This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Infimal Sub-differential Size of Primal-Dual Hybrid Gradient Method and Beyond

Haihao Lu Corresponding author, The University of Chicago, Booth School of Business (haihao.lu@chicagobooth.edu).    Jinwen Yang The University of Chicago, Department of Statistics (jinweny@uchicago.edu).
(March 2023)
Abstract

Primal-dual hybrid gradient method (PDHG, a.k.a. Chambolle and Pock method [10]) is a well-studied algorithm for minimax optimization problems with a bilinear interaction term. Recently, PDHG is used as the base algorithm for a new LP solver PDLP that aims to solve large LP instances by taking advantage of modern computing resources, such as GPU and distributed system. Most of the previous convergence results of PDHG are either on duality gap or on distance to the optimal solution set, which are usually hard to compute during the solving process. In this paper, we propose a new progress metric for analyzing PDHG, which we dub infimal sub-differential size (IDS), by utilizing the geometry of PDHG iterates. IDS is a natural extension of the gradient norm of smooth problems to non-smooth problems, and it is tied with KKT error in the case of LP. Compared to traditional progress metrics for PDHG, such as duality gap and distance to the optimal solution set, IDS always has a finite value and can be computed only using information of the current solution. We show that IDS monotonically decays, and it has an 𝒪(1k)\mathcal{O}(\frac{1}{k}) sublinear rate for solving convex-concave primal-dual problems, and it has a linear convergence rate if the problem further satisfies a regularity condition that is satisfied by applications such as linear programming, quadratic programming, TV-denoising model, etc. Furthermore, we present examples showing that the obtained convergence rates are tight for PDHG. The simplicity of our analysis and the monotonic decay of IDS suggest that IDS is a natural progress metric to analyze PDHG. As a by-product of our analysis, we show that the primal-dual gap has 𝒪(1k)\mathcal{O}(\frac{1}{\sqrt{k}}) convergence rate for the last iteration of PDHG for convex-concave problems. The analysis and results on PDHG can be directly generalized to other primal-dual algorithms, for example, proximal point method (PPM), alternating direction method of multipliers (ADMM) and linearized alternating direction method of multipliers (l-ADMM).

1 Introduction

Linear programming (LP) [5, 16] is a fundamental problem in mathematical optimization and operations research. The traditional LP solvers are essentially based on either simplex method or barrier method. However, it is highly challenging to further scale up these two methods. The computational bottleneck of both methods is solving linear equations, which does not scale well on modern computing resources, such as GPUs and distributed systems. Recently, a numerical study [2] demonstrates that an enhanced version of Primal-Dual Hybrid Gradient (PDHG) method, called PDLP111The solver is open-sourced at Google OR-Tools. , can reliably solve LP problems to high precision. As a first-order method, the computational bottleneck of PDHG is matrix-vector multiplication, which can efficiently leverage GPUs and distributed system. It sheds light on solving huge-scale LP where the data need to be stored in a distributed system.

This work is motivated by this recent line of research on using PDHG for solving large-scale LP. The convergence guarantee of PDHG, for generic convex-concave problems or for LP, has been extensively studied in the literature [66, 10, 11, 1, 4]. Most of the previous convergence analyses on PDHG are based on either the primal-dual gap or the distance to the optimal solution setting. However, it can be highly non-trivial to compute these metrics while running the algorithms. For primal-dual gap, the iterated solutions when solving LP are often infeasible to the primal and/or the dual problem until the algorithm identifies an optimal solution; thus, the primal-dual gap at any iterates may likely be infinity. For the distance to the optimal solution set, it is not computable unless the optimal solution set is known. Furthermore, even if these values can be computed, they often oscillate dramatically in practice, which makes it hard to evaluate the solving progress. Motivated by these issues, the goal of this paper is to answer the following question:

Is there an evaluatable and monotonically decaying progress metric for analyzing PDHG?

We provide an affirmative answer to the above question by proposing a new progress metric for analyzing PDHG, which we dub infimal sub-differential size (IDS). IDS essentially measures the distance between 0 and the sub-differential set of the objective. It is a natural extension of the gradient norm to potentially non-smooth problems. Compared with other progress measurements for PDHG, such as primal-dual gap and distance to optimality, IDS always has a finite value and can be computable directly only using the information of the current solution without the need of knowing the optimal solution set. More importantly, IDS decays monotonically during the solving process, making it a natural progress measurement for PDHG. The design of IDS also take into consideration the geometry of PDHG by using a natural norm of PDHG.

Furthermore, most previous PDHG analyzes [10, 11] focus on the ergodic rate, that is, the performance of the average PDHG iterates. On the other hand, it is frequently observed that the last iteration of PDHG has comparable and sometimes superior performance compared to the average iteration in practice [2]. Our analysis on IDS provides an explanation of such behavior. We show that IDS of the last iterate of PDHG has 𝒪(1/k)\mathcal{O}(1/k) convergence rate for convex-concave problems. As a direct byproduct, the primal dual gap at the last iterate has 𝒪(1/k)\mathcal{O}(1/\sqrt{k}) convergence rate, which is inferior to the 𝒪(1/k)\mathcal{O}(1/k) convergence of the average iterate [11]. This explains why the average iterate can be a good option. Furthermore, we show that if the problem further has metric sub-regularity, a condition that is satisfied by many optimization problems, IDS and primal dual gap both enjoy linear convergence, whereas average iterates can often have sublinear convergence. This, accompanied with other results under metric sub-regularity in literature, explains the numerical observation that the last iteration may have superior behaviors.

More formally, we consider a generic minimax problem with a bilinear interaction term:

minxnmaxym(x,y)=f(x)+Ax,yg(y)\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}\mathcal{L}(x,y)=f(x)+\left\langle Ax,y\right\rangle-g(y) (1)

where f:n(,+]f:\mathbb{R}^{n}\rightarrow(-\infty,+\infty], g:m(,+]g:\mathbb{R}^{m}\rightarrow(-\infty,+\infty] are simple proper lower semi-continuous convex functions and Am×nA\in\mathbb{R}^{m\times n} is a matrix. Note that functions f(x)f(x) and g(y)g(y) can encode the constraints x𝒳x\in\mathcal{X} and y𝒴y\in\mathcal{Y} by using an indicator function. Throughout the paper, we assume that there is a finite optimal solution to the minimax problem (1).The primal-dual form of linear programming can be formulated as an instance of (1):

minx0maxy(x,y)=cTx+yTAxbTy.\min_{x\geq 0}\max_{y}\mathcal{L}(x,y)=c^{T}x+y^{T}Ax-b^{T}y\ .

Beyond LP, problems of form (1) are ubiquitous in statistics and machine learning. For example,

  • In image processing, total variation (TV) denoising model [7] plays a central role in variational methods for imaging and it can be reformulated as minx0maxy:y1(x,y)=λ2xf22+yTx\min_{x\geq 0}\max_{y:\|y\|_{\infty}\leq 1}\mathcal{L}(x,y)=\frac{\lambda}{2}\|x-f\|_{2}^{2}+y^{T}\nabla x, where \nabla is the differential matrix.

  • In statistics, LASSO [57, 12] is a popular regression model, whose primal-dual formulation is minxmaxy(x,y)=λx1+yTAx(12y22+bTy)\min_{x}\max_{y}\mathcal{L}(x,y)=\lambda\|x\|_{1}+y^{T}Ax-(\frac{1}{2}\|y\|_{2}^{2}+b^{T}y).

  • In machine learning, support vector machine (SVM) [15, 6] is a classic supervised learning method for classification and can be formulated as minxmax1/nyi/bi0(x,y)=λ2x22+yTAx1ni=1nnyibi\min_{x}\max_{-1/n\leq y_{i}/b_{i}\leq 0}\mathcal{L}(x,y)=\frac{\lambda}{2}\|x\|_{2}^{2}+y^{T}Ax-\frac{1}{n}\sum_{i=1}^{n}\frac{ny_{i}}{b_{i}}.

Primal-dual hybrid gradient method (PDHG, a.k.a. Chambolle and Pock algorithm [10]) described in Algorithm 3 is one of the most popular algorithms for solving the structured minimax problem (1). It has been extensively used in the field of image processing: total-variation image denoising [7], multimodal medical imaging [38], computation of nonlinear eigenfunctions [28], and many others. More recently, PDHG also serves as the base algorithm for a new large-scale LP solver PDLP [2, 3, 4].

Algorithm 1 Primal Dual Hybrid Gradient (PDHG)333PDHG is often presented in a form with different primal and dual step sizes [10, 11]. Here, we choose to use the same primal and dual step size for notational convenience. Our results can easily extend to the case of different step sizes by rescaling the variable.
0:  Initial point (x0,y0)(x_{0},y_{0}), step-size s>0s>0.
1:  for k=0,1,k=0,1,... do
2:     xk+1=proxfs(xksATyk)x_{k+1}=\text{prox}_{f}^{s}(x_{k}-sA^{T}y_{k})
3:     yk+1=proxgs(yk+sA(2xk+1xk))y_{k+1}=\text{prox}_{g}^{s}(y_{k}+sA(2x_{k+1}-x_{k}))
4:  end for

The contribution of the paper can be summarized as follows:

  • We propose to identify the “right” progress metric when studying an optimization algorithm. We propose a new progress metric IDS for analyzing PDHG, and show that it monotonically decays (see Section 2).

  • We show that IDS converges to 0 with 𝒪(1/k)\mathcal{O}(1/k) convergence rate when solving a convex-concave minimax problem (1) (Theorem 1). As a by-product, this shows 𝒪(1/k)\mathcal{O}(1/\sqrt{k}) non-ergodic convergent rate of the duality gap for PDHG when solving convex-concave minimax problems.

  • We show that IDS converges to 0 with a linear convergence rate if the problem satisfies metric sub-regularity, which is satisfied by many applications.

  • We extend the above results of PDHG to other classic algorithms for minimax problems, including proximal point method, alternating direction method of multipliers and linearized alternating direction method of multipliers.

  • The proofs of the above results are surprisingly simple, which is in contrast to the recent literature on the last iterate convergence for cousin algorithms.

1.1 Related literature

Convex-concave minimax problems. There have been several lines of research that develop algorithms for convex-concave minimax problems and general monotone variational inequalities. The two most classic algorithms may be proximal point method (PPM) and extragradient method (EGM). Rockafellar introduces PPM for monotone variational inequalities [55] and, around the same time, Korpelevich proposes EGM for solving the convex-concave minimax problem [40]. Convergence analysis on PPM and EGM has been flourishing since then. Tseng proves the linear convergence for PPM and EGM on strongly-convex-strongly-concave minimax problems and on unconstrained bilinear problems [58]. Nemirovski shows that EGM, as a special case of mirror-prox algorithm, exhibits 𝒪(1ϵ)\mathcal{O}(\frac{1}{\epsilon}) sublinear rate for solving general convex-concave minimax problems over a compact and bounded set [51]. Moreover, the connection between PPM and EGM is strengthened in [51], i.e., EGM approximates PPM. Another line of research investigates minimax problems with a bilinear interaction term (1). For such problems, the two most popular algorithms are perhaps PDHG [10] and Douglas-Rachford splitting (DRS) [22, 24]. Recently, motivated by application in statistical and machine learning, there has been renewed interest in minimax problems. In particular, for bilinear problem (f(x)=g(y)=0f(x)=g(y)=0 in (1)), [17, 50] show that the Optimistic Gradient Descent Ascent (OGDA) enjoys a linear convergence rate with full rank AA and [50] proves that EGM and PPM also converge linearly under the same condition. Besides, [50] builds up the connection between OGDA and PPM: OGDA is an approximation to PPM for bilinear problems. Under the framework of ODE, [46] investigates the dynamics of unconstrained primal-dual methods and yields tight condition under which different algorithms exhibit linear convergence.

Primal-dual hybrid gradient method (PDHG). The early works on PDHG were motivated by applications in image processing and computer vision [66, 14, 25, 34, 10]. Recently, PDHG is the base algorithm used in a large-scale LP solver  [2, 3, 4]. The first convergence guarantee of PDHG was in the average iteration and was proposed in [10]. Later, [11] presented a simplified and unified analysis of the ergodic rate of PDHG. More recently, many variants of PDHG have been proposed, including adaptive version [59, 49, 54, 29] and stochastic version [9, 1, 47]. PDHG was also shown to be equivalent to DRS [53, 45].

Ergodic rate vs non-ergodic rate. In the literature on convex-concave minimax problems, the convergence is often described in the ergodic sense, i.e., on the average of iterates. For example, Nemirovski [51] shows that EGM (or more generally the mirror prox algorithm) has 𝒪(1/k)\mathcal{O}(1/k) primal-dual gap for the average iterate; Chambolle and Pock [10, 11] show that PDHG has 𝒪(1/k)\mathcal{O}(1/k) primal-dual gap on the average iteration. More recently, the convergence rate of last iterates (i.e., non-ergodic rate) attracts much attention in the machine learning and optimization community due to its practical usage. Most of the works on last iterates study the gradient norm when the problem is differentiable. For example, [30, 31] show that the squared norm of gradient of iterates from EGM converges at 𝒪(1/k)\mathcal{O}(1/k) sublinear rate. [61] proposes an extra anchored gradient descent algorithm converges with faster rate 𝒪(1/k2)\mathcal{O}(1/k^{2}) when the objective function is smooth. The recent works [8, 32] study the last iteration convergence of EGM and OGDA in the constrained setting for EGM and OGDA. More specifically, [8] utilizes the sum-of-squares (SOS) programming to analyze the tangent residual and [32] uses semi-definite programming (SDP) and Performance Estimation Problems (PEPs) to derive the last iteration bounds of a complicated potential function for the two algorithms. The tangent residual can be viewed as a special case of IDS when ff and gg are the sum of a smooth function and an indicator function, and IDS can handle general non-smooth terms.

Another related line of works is the study of squared distance between successive iterates for splitting schemes [18, 19]. Although there are connections between IDS and distance between successive iterates, the geometric interpretations are quite different: IDS is a direct quality metric to characterize the size of the sub-differential at current iterate while the distance between successive iterates measures fixed-point residual, which is an indirect characterization of the quality of the solution. In terms of the primal-dual gap, [30] shows that the last iterate of EGM has 𝒪(1/k)\mathcal{O}(1/\sqrt{k}) rate, which is slower than the average iterate of EGM, and [18, 27] shows 𝒪(1/k)\mathcal{O}(1/\sqrt{k}) rate of objective errors for some primal-dual methods. As a by-product of our analysis, we show that the last iterate of PDHG also has 𝒪(1/k)\mathcal{O}(1/\sqrt{k}) rate in the primal-dual gap.

On the other hand, linear convergence of primal-dual algorithms are often observed in the last iterate. To obtain the linear convergence rate, additional conditions are required. There are three different types of conditions studied in the literature: strong-convexity-concavity [58] (i.e. strong monotoncity), interaction dominance/negative comonotonicity [33, 43], and metric sub-regularity [42, 44, 1, 26]. The metric used in the above analysis is usually the norm of gradient for differentiable problems or the distance to optimality. The linear convergence under additional assumptions partially explains the behavior that practitioners may choose to favor the last iterate over the average iterate.

Metric sub-regularity. The notion of metric sub-regularity is extensively studied in the community of variational analysis [36, 37, 20, 41, 63, 64, 65, 21]. It was first introduced in [36] and later the terminology “metric sub-regularity” was suggested in [20], which is also closely related to calmness [35]. For more detail in this direction, see [21].

The regularity condition holds for many important applications. For example, [65, 42] show that the problem with piecewise linear quadratic functions on a compact set satisfies the regularity condition, including Lasso and support vector machines, etc. Partially due to this reason, recently there arises extensive interest in analyzing first-order methods under the assumption of metric sub-regularity. For convex minimization, [23] shows that the metric sub-regularity of sub-gradient is equivalent to the quadratic error bound. The results for primal-dual algorithms are also fruitful  [42, 44, 1, 26, 48]. In the context of PDHG, under metric sub-regularity, [42, 26] proves the linear rate of deterministic PDHG and [1] shows the linear convergence of stochastic PDHG in terms of distance to the optimal set.

1.2 Notations

We use 2\|\cdot\|_{2} to denote Euclidean norm and ,\left\langle\cdot,\cdot\right\rangle for its associated inner product. For a positive definite matrix M0M\succ 0, we denote ,M=,M\left\langle\cdot,\cdot\right\rangle_{M}=\left\langle\cdot,M\cdot\right\rangle. Let M\|\cdot\|_{M} be the norm induced by the inner product ,M\left\langle\cdot,\cdot\right\rangle_{M}. Let distM(z,𝒵)dist_{M}(z,\mathcal{Z}) be the distance between point zz and set 𝒵\mathcal{Z} under the norm M\|\cdot\|_{M}, that is, distM(z,𝒵)=minu𝒵uMdist_{M}(z,\mathcal{Z})=\min_{u\in\mathcal{Z}}\|u\|_{M}. Denote 𝒵\mathcal{Z}^{*} as the optimal solution set to (1). Let σmin+(A)\sigma_{min}^{+}(A) be the minimum nonzero singular value of a matrix AA. Denote A\|A\| the operator norm of a matrix AA. Let ι𝒞()\iota_{\mathcal{C}}(\cdot) be the indicator function of the set 𝒞\mathcal{C}. f(x)=𝒪(g(x))f(x)=\mathcal{O}(g(x)) denote that for sufficiently large xx, there exists constant CC such that f(x)Cg(x)f(x)\leq Cg(x) and f(x)=Ω(g(x))f(x)=\Omega(g(x)) denote that for sufficiently large xx, there exists constant cc such that f(x)cg(x)f(x)\geq cg(x). The proximal operator is defined as proxhτ(x):=argminyn{h(y)+12τyx2}\text{prox}_{h}^{\tau}(x):=\text{argmin}_{y\in\mathbb{R}^{n}}\left\{h(y)+\frac{1}{2\tau}\|y-x\|^{2}\right\}. Denote z=(x,y)z=(x,y), and (z)=(x,y)=(f(x)+ATyAx+g(y))\mathcal{F}(z)=\mathcal{F}(x,y)=\begin{pmatrix}\partial f(x)+A^{T}y\\ -Ax+\partial g(y)\end{pmatrix} as the sub-differential of the objective. For a matrix AA, denote AA^{-} a pseudo-inverse of AA and range(A)\mathrm{range}(A) the range of AA.

2 Infimal sub-differential size (IDS)

In this section, we introduce a new progress measurement, which we dub infimal sub-differential size (IDS), for PDHG. IDS is a natural extension of the squared norm of gradient for smooth optimization problems to non-smooth optimization problems. Compared to other progress measurements for PDHG, such as primal-dual gap and distance to optimality, IDS always has a finite value and is computable directly without the need to know the optimal solution set. More importantly, unlike the other metric that may oscillate over time, the IDS monotonically decays along the iteration of the PDHG, further suggesting that the IDS is a natural progress measurement for PDHG. Here is the formal definition of IDS:

Definition 1.

The infimal sub-differential size (IDS) at solution zz for PDHG with step-size ss is defined as:

distPs12(0,(z)),dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z))\ , (2)

where Ps=(1sInATA1sIm)P_{s}=\begin{pmatrix}\frac{1}{s}I_{n}&-A^{T}\\ -A&\frac{1}{s}I_{m}\end{pmatrix} is the PDHG norm and (z)=(f(x)+ATyAx+g(y))\mathcal{F}(z)=\begin{pmatrix}\partial f(x)+A^{T}y\\ -Ax+\partial g(y)\end{pmatrix} is the sub-differential of the objective (x,y)\mathcal{L}(x,y).

Note that the first-order optimality condition of (1) is 0(z)0\in\mathcal{F}(z). In other words, if the original minimax problem (1) is feasible and bounded, the minimization problem

minzm+ndistPs12(0,(z))\min_{z\in\mathbb{R}^{m+n}}dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z)) (3)

has optimal value 0, and furthermore (3) and (1) share the same optimal solution set, thus IDS is a valid progress measurement.

Refer to caption
(a) PDHG
Refer to caption
(b) PPM
Figure 1: The trajectory of PDHG and PPM to solve a simple unconstrained bilinear problem minxmaxyxy\min_{x}\max_{y}xy.

When measuring the distance between 0 and the set (z)\mathcal{F}(z), a crucial part is the choice of the norm. The simplest norm which is also used in practice is the 2\ell_{2} norm. However, this may not be the natural norm for all algorithms. To understand the intuition, Figure 1 plots the trajectory of PDHG and proximal point method (PPM) to solve a simple unconstrained bilinear problem

minxmaxy(x,y):=xy.\min_{x\in\mathbb{R}}\max_{y\in\mathbb{R}}\mathcal{L}(x,y):=xy\ .

We can observe that the iterations of two algorithms exhibit spiral patterns toward the unique saddle point (0,0)(0,0). Nevertheless, in contrast to the circular pattern of PPM, the spiral of PDHG spins in a tilting elliptical manner. The fundamental reason for such an elliptical structure is the asymmetricity of the primal and dual steps in PDHG. The elliptical structures of the iterates can be captured by the choice of norms in the analysis. This is consistent with the fact that while the analysis of PPM [55] utilizes 2\ell_{2} norm, the analysis of PDHG [11, 4, 34] often utilizes a different norm defined by Ps=(1sInATA1sIm)P_{s}=\begin{pmatrix}\frac{1}{s}I_{n}&-A^{T}\\ -A&\frac{1}{s}I_{m}\end{pmatrix}. The PsP_{s} has already been used in analyzing PDHG, for example, [34] studies the contraction of PDHG under PsP_{s} norm and motivate variants of PDHG from this perspective. We here propose to use Ps1P_{s}^{-1} in the sub-differentiable (dual) space.

In previous works, the convergence guarantee of PDHG is often stated with respect to the duality gap [10, 11], maxx^𝒳,y^𝒴(x,y^)(x^,y)\max_{\hat{x}\in\mathcal{X},\hat{y}\in\mathcal{Y}}\mathcal{L}(x,\hat{y})-\mathcal{L}(\hat{x},y), or the distance to the optimal solution set [4, 26], dist(z,𝒵)dist(z,\mathcal{Z}^{*}). Unfortunately, these values are barely computable or informative because the iterates are usually infeasible until the end of the algorithms (thus, the duality gap at the iterates is often infinity) and the optimal solution set is unknown to the users beforehand. In practice, people instead use residual as a performance metric because of its computational efficiency. For example, the KKT residual, i.e., the violation of primal feasibility, dual feasibility, and complementary slackness, is widely used in the LP solver as the performance metric. Indeed, the KKT residual of LP is upper bounded by IDS upto a constant. More formally, consider the primal-dual form of standard LP

minx0maxycTxyTAx+bTy,\min_{x\geq 0}\max_{y}\;c^{T}x-y^{T}Ax+b^{T}y\ ,

and the corresponding KKT residual for z=(x,y)z=(x,y), x0x\geq 0 is given by (Axb[ATyc]+[cTxbTy]+)2\left\|\begin{pmatrix}Ax-b\\ [A^{T}y-c]^{+}\\ [c^{T}x-b^{T}y]^{+}\end{pmatrix}\right\|_{2}.
Then it holds for any z=(x,y)z=(x,y), x0x\geq 0 that

dist2(0,(z))ρr2(z)11+4R2(Axb[ATyc]+[cTxbTy]+)22,dist^{2}(0,\mathcal{F}(z))\geq\rho_{r}^{2}(z)\geq\frac{1}{1+4R^{2}}\left\|\begin{pmatrix}Ax-b\\ [A^{T}y-c]^{+}\\ [c^{T}x-b^{T}y]^{+}\end{pmatrix}\right\|_{2}^{2}\ ,

where RR is an upper bound for the norm of iterates zkz_{k} and ρr(z)\rho_{r}(z) is the normalized duality gap defined in [4, Equation (4a)]. The first inequality follows from [4, Proposition 5] and the second inequality uses [4, Lemma 4]. Therefore, IDS under l2l_{2} norm is an upper bound of KKT residual for LP. Since all norms in finite-dimensional Hilbert space are equivalent (upto a constant), this showcases that the IDS under PsP_{s} norm also provides an upper bound on KKT residual.

Since (z)\mathcal{F}(z) is in the dual space, we utilized the dual norm of PsP_{s} in the definition of IDS. In fact, for the simple bilinear problem in Figure 1, the use of Ps1P_{s}^{-1} norm is the key to guarantee the monotonic decay of IDS for PDHG iterates. Indeed, this monotonicity result holds for any convex-concave problems and can be formalized in the below proposition:

Proposition 1.

Consider PDHG (Algorithm 3) with step-size s<1As<\frac{1}{\|A\|} for solving a convex-concave minimax problem (1). It holds for k0k\geq 0 that

distPs12(0,(zk+1))distPs12(0,(zk)).dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k+1}))\leq dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k}))\ .
Proof.

First, notice that the iterate update of Algorithm 3 can be written as

Ps(zkzk+1)(zk+1).P_{s}(z_{k}-z_{k+1})\in\mathcal{F}(z_{k+1})\ . (4)

Note that PDHG is a forward-backward algorithm [13], we can rewrite the proximal operators in Algorithm 3 as the following

0f(xk+1)+1s(xk+1xk+sATyk) 0g(yk+1)+1s(yk+1yksA(2xk+1xk)).\displaystyle\begin{split}&\ 0\in\partial f(x_{k+1})+\frac{1}{s}(x_{k+1}-x_{k}+sA^{T}y_{k})\\ &\ 0\in\partial g(y_{k+1})+\frac{1}{s}(y_{k+1}-y_{k}-sA(2x_{k+1}-x_{k}))\ .\end{split} (5)

We arrive at (4) by rearranging (5):

Ps(zkzk+1)=(1sInATA1sIm)(xkxk+1ykyk+1)=(1s(xkxk+1)AT(ykyk+1)A(xkxk+1)+1s(ykyk+1))(f(xk+1)+ATyk+1Axk+1+g(yk+1))=(zk+1).\displaystyle\begin{split}P_{s}(z_{k}-z_{k+1})&\ =\begin{pmatrix}\frac{1}{s}I_{n}&-A^{T}\\ -A&\frac{1}{s}I_{m}\end{pmatrix}\begin{pmatrix}x_{k}-x_{k+1}\\ y_{k}-y_{k+1}\end{pmatrix}=\begin{pmatrix}\frac{1}{s}(x_{k}-x_{k+1})-A^{T}(y_{k}-y_{k+1})\\ -A(x_{k}-x_{k+1})+\frac{1}{s}(y_{k}-y_{k+1})\end{pmatrix}\\ &\ \in\begin{pmatrix}\partial f(x_{k+1})+A^{T}y_{k+1}\\ -Ax_{k+1}+\partial g(y_{k+1})\end{pmatrix}=\mathcal{F}(z_{k+1})\ .\end{split}

Now denote ω~k+1=Ps(zkzk+1)(zk+1)\widetilde{\omega}_{k+1}=P_{s}(z_{k}-z_{k+1})\in\mathcal{F}(z_{k+1}), then it holds for any ωk(zk)\omega_{k}\in\mathcal{F}(z_{k}) that

ω~k+1Ps12ωkPs12=ω~k+1ωk,Ps1(ω~k+1+ωk)ω~k+1ωk,2(zk+1zk)+Ps1(ω~k+1+ωk)=Ps(zkzk+1)ωk,2(zk+1zk)+Ps1(Ps(zkzk+1)+ωk)=Ps(zkzk+1)ωk,(zkzk+1)+Ps1ωk=zkzk+1Ps2+2ωkT(zkzk+1)ωkPs12=Ps(zkzk+1)ωkPs12=ω~k+1ωkPs120,\displaystyle\begin{split}\|\widetilde{\omega}_{k+1}\|_{P_{s}^{-1}}^{2}-\|\omega_{k}\|_{P_{s}^{-1}}^{2}&=\left\langle\widetilde{\omega}_{k+1}-\omega_{k},P_{s}^{-1}(\widetilde{\omega}_{k+1}+\omega_{k})\right\rangle\\ &\leq\left\langle\widetilde{\omega}_{k+1}-\omega_{k},2(z_{k+1}-z_{k})+P_{s}^{-1}(\widetilde{\omega}_{k+1}+\omega_{k})\right\rangle\\ &=\left\langle P_{s}(z_{k}-z_{k+1})-\omega_{k},2(z_{k+1}-z_{k})+P_{s}^{-1}(P_{s}(z_{k}-z_{k+1})+\omega_{k})\right\rangle\\ &=\left\langle P_{s}(z_{k}-z_{k+1})-\omega_{k},-(z_{k}-z_{k+1})+P_{s}^{-1}\omega_{k}\right\rangle\\ &=-\|z_{k}-z_{k+1}\|_{P_{s}}^{2}+2\omega_{k}^{T}(z_{k}-z_{k+1})-\|\omega_{k}\|_{P_{s}^{-1}}^{2}\\ &=-\|P_{s}(z_{k}-z_{k+1})-\omega_{k}\|_{P_{s}^{-1}}^{2}\\ &=-\|\widetilde{\omega}_{k+1}-\omega_{k}\|_{P_{s}^{-1}}^{2}\\ &\leq 0\ ,\end{split}

where the inequality utilizes \mathcal{F} is a monotone operator due to the convexity-concavity of the objective (x,y)\mathcal{L}(x,y), thus ω~k+1ωk,zk+1zk0\left\langle\widetilde{\omega}_{k+1}-\omega_{k},z_{k+1}-z_{k}\right\rangle\geq 0, and the second equality uses ω~k+1=Ps(zkzk+1)\widetilde{\omega}_{k+1}=P_{s}(z_{k}-z_{k+1}). Now choose ωk=argminw(zk){wPs12}\omega_{k}=\arg\min_{w\in\mathcal{F}(z_{k})}\{\|w\|_{P_{s}^{-1}}^{2}\}, and we obtain

distPs12(0,(zk+1))ω~k+1Ps12ωkPs12=distPs12(0,(zk)),dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k+1}))\leq\|\widetilde{\omega}_{k+1}\|_{P_{s}^{-1}}^{2}\leq\|\omega_{k}\|_{P_{s}^{-1}}^{2}=dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k}))\ ,

which finishes the proof. ∎

We end this section by making two remarks on the proof of Proposition 1:

Remark 1.

The proof of Proposition 1 not only shows the monotonicity of IDS, but also provides a bound on its decay:

distPs12(0,(zk+1))distPs12(0,(zk))ω~k+1ωkPs12,dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k+1}))\leq dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k}))-\|\widetilde{\omega}_{k+1}-\omega_{k}\|_{P_{s}^{-1}}^{2}\ , (6)

where ωk=argminw(zk){wPs12}(zk)\omega_{k}=\arg\min_{w\in\mathcal{F}(z_{k})}\{\|w\|_{P_{s}^{-1}}^{2}\}\in\mathcal{F}(z_{k}) and ω~k+1=Ps(zkzk+1)(zk+1)\widetilde{\omega}_{k+1}=P_{s}(z_{k}-z_{k+1})\in\mathcal{F}(z_{k+1}).

Remark 2.

In the proof of Proposition 1 and the convergence proof in Section 3 and Section 4, the only information we use from PDHG is the update rule (4) and the fact that PsP_{s} is a positive definite matrix. This makes it easy to generalize the results of PDHG to other algorithms. For example, PPM to solve the convex-concave minimax problem minxmaxy(x,y)\min_{x}\max_{y}\mathcal{L}(x,y) has the following update rule:

1s(zkzk+1)(zk+1),\frac{1}{s}(z_{k}-z_{k+1})\in\mathcal{F}(z_{k+1})\ ,

and the iterate update is also a special case of (4) where Ps=1sIP_{s}=\frac{1}{s}I and (z)=(x(x,y)y(x,y))\mathcal{F}(z)=\begin{pmatrix}\partial_{x}\mathcal{L}(x,y)\\ -\partial_{y}\mathcal{L}(x,y)\end{pmatrix}. Thus, all of the analysis and results we develop herein for PDHG can be directly extended to PPM in parallel.

3 Sublinear convergence of IDS

In this section, we present the sublinear convergence rate of IDS. Our major result is stated in Theorem 1. As a direct consequence, the result implies sublinear rate of non-ergodic iterates on duality gap.

Theorem 1.

Consider the iterates {zk}k=0\{z_{k}\}_{k=0}^{\infty} of PDHG (Algorithm 3) to solve a convex-concave minimax problem (1). Let z𝒵z_{*}\in\mathcal{Z}^{*} be an optimal point to (1). Suppose the step-size satisfies s<1As<\frac{1}{\|A\|}. Then, it holds for any iteration k1k\geq 1 that

distPs12(0,(zk))1kz0zPs2.dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k}))\leq\frac{1}{k}\|z_{0}-z_{*}\|_{P_{s}}^{2}\ .
Proof.

Recall the update of PDHG (4). Then we have for any ii that:

ω~i+1:=Ps(zizi+1)(zi+1).\widetilde{\omega}_{i+1}:=P_{s}(z_{i}-z_{i+1})\in\mathcal{F}(z_{i+1})\ .

It holds for any iteration i0i\geq 0 that

zizPs2=zizi+1+zi+1zPs2=zizi+1Ps2+zi+1zPs2+2zi+1z,Ps(zizi+1)=ωi+1Ps12+zi+1zPs2+2zi+1z,ω~i+10zi+1zPs2+ω~i+1Ps12,\displaystyle\begin{split}\|z_{i}-z_{*}\|_{P_{s}}^{2}&\ =\|z_{i}-z_{i+1}+z_{i+1}-z_{*}\|_{P_{s}}^{2}\\ &\ =\|z_{i}-z_{i+1}\|_{P_{s}}^{2}+\|z_{i+1}-z_{*}\|_{P_{s}}^{2}+2\left\langle z_{i+1}-z_{*},P_{s}(z_{i}-z_{i+1})\right\rangle\\ &\ =\|\omega_{i+1}\|_{P_{s}^{-1}}^{2}+\|z_{i+1}-z_{*}\|_{P_{s}}^{2}+2\left\langle z_{i+1}-z_{*},\widetilde{\omega}_{i+1}-0\right\rangle\\ &\ \geq\|z_{i+1}-z_{*}\|_{P_{s}}^{2}+\|\widetilde{\omega}_{i+1}\|_{P_{s}^{-1}}^{2}\ ,\end{split}

where the inequality comes from the monotonicity of operator \mathcal{F} and notices 0(z)0\in\mathcal{F}(z_{*}). Thus, we have

distPs12(0,(zi+1))ω~i+1Ps12zizPs2zi+1zPs2.dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{i+1}))\leq\|\widetilde{\omega}_{i+1}\|_{P_{s}^{-1}}^{2}\leq\|z_{i}-z_{*}\|_{P_{s}}^{2}-\|z_{i+1}-z_{*}\|_{P_{s}}^{2}\ . (7)

Therefore, it follows that

distPs12(0,(zk))1ki=1kdistPs12(0,(zi))1k(i=1k(zi1zPs2zizPs2))1kz0zPs2,\displaystyle\begin{split}dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k}))&\ \leq\frac{1}{k}\sum_{i=1}^{k}dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{i}))\\ &\ \leq\frac{1}{k}\left({\sum_{i=1}^{k}\left({\|z_{i-1}-z_{*}\|_{P_{s}}^{2}-\|z_{i}-z_{*}\|_{P_{s}}^{2}}\right)}\right)\\ &\ \leq\frac{1}{k}\|z_{0}-z_{*}\|_{P_{s}}^{2}\ ,\end{split}

where the first inequality follows from Proposition 1 and the second inequality is due to inequality (7). ∎

As a direct consequence of Theorem 1, we can obtain 𝒪(1k)\mathcal{O}(\frac{1}{\sqrt{k}}) convergence rate on the primal-dual gap for the last iteration of PDHG. The last iteration of PDHG has a slower convergence rate compared to the average iteration, which has 𝒪(1k)\mathcal{O}(\frac{1}{k}) rate on the primal-dual gap [11]. This is consistent with the recent discovery that the last iteration has a slower convergence than the average iteration in a smooth convex-concave saddle point for the extragradient method [30], Douglas-Rachford splitting [19, 18], and primal-dual coordinate descent method [27].

Corollary 1.

Under the same assumption and notation as Theorem 1, it holds for any iteration k1k\geq 1, z=(x,y)𝒵z=(x,y)\in\mathcal{Z} and any optimal solution zz_{*} that

(xk,y)(x,yk)1k(z0zPs2+z0zPszzPs).\mathcal{L}(x_{k},y)-\mathcal{L}(x,y_{k})\leq\frac{1}{\sqrt{k}}\left({\|z_{0}-z_{*}\|_{P_{s}}^{2}+\|z_{0}-z_{*}\|_{P_{s}}\|z_{*}-z\|_{P_{s}}}\right)\ .
Proof.

Denote ukf(xk)u_{k}\in\partial f(x_{k}), vkg(yk)v_{k}\in\partial g(y_{k}) and wk=(uk+ATykvkAxk)(zk)w_{k}=\begin{pmatrix}u_{k}+A^{T}y_{k}\\ v_{k}-Ax_{k}\end{pmatrix}\in\mathcal{F}(z_{k}). Then

(xk,y)(x,yk)=(xk,y)(xk,yk)+(xk,yk)(x,yk)(vk+Axk)T(yyk)+(uk+ATyk)T(xkx)=wk,zkzwkPs1zkzPs,\displaystyle\begin{split}\mathcal{L}(x_{k},y)-\mathcal{L}(x,y_{k})&\ =\mathcal{L}(x_{k},y)-\mathcal{L}(x_{k},y_{k})+\mathcal{L}(x_{k},y_{k})-\mathcal{L}(x,y_{k})\\ &\ \leq(-v_{k}+Ax_{k})^{T}(y-y_{k})+(u_{k}+A^{T}y_{k})^{T}(x_{k}-x)\\ &\ =\left\langle w_{k},z_{k}-z\right\rangle\leq\|w_{k}\|_{P_{s}^{-1}}\|z_{k}-z\|_{P_{s}}\ ,\end{split} (8)

where the first inequality uses the convexity of ff and gg and the last one follows from Cauchy-Schwarz inequality. Notice (8) holds for any wk(zk)w_{k}\in\mathcal{F}(z_{k}), we can choose ωk=argminω(zk){ωPs12}\omega_{k}=\arg\min_{\omega\in\mathcal{F}(z_{k})}\{\|\omega\|_{P_{s}^{-1}}^{2}\} and thus ωkPs12=distPs12(0,(zk))\|\omega_{k}\|_{P_{s}^{-1}}^{2}=dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k})). Then it holds that

(xk,y)(x,yk)distPs1(0,(zk))zkzPs1kz0zPszkzPs1kz0zPs(zkzPs+zzPs)1kz0zPs(z0zPs+zzPs).\displaystyle\begin{split}\mathcal{L}(x_{k},y)-\mathcal{L}(x,y_{k})&\ \leq dist_{P_{s}^{-1}}(0,\mathcal{F}(z_{k}))\|z_{k}-z\|_{P_{s}}\\ &\ \leq\sqrt{\frac{1}{k}}\|z_{0}-z_{*}\|_{P_{s}}\|z_{k}-z\|_{P_{s}}\\ &\ \leq\frac{1}{\sqrt{k}}\|z_{0}-z_{*}\|_{P_{s}}\left({\|z_{k}-z_{*}\|_{P_{s}}+\|z_{*}-z\|_{P_{s}}}\right)\\ &\ \leq\frac{1}{\sqrt{k}}\|z_{0}-z_{*}\|_{P_{s}}\left({\|z_{0}-z_{*}\|_{P_{s}}+\|z_{*}-z\|_{P_{s}}}\right)\ .\end{split}

where the second inequality is due to Theorem 1; the third inequality follows from the triangle inequality, and the last inequality uses zkzPsz0zPs\|z_{k}-z_{*}\|_{P_{s}}\leq\|z_{0}-z_{*}\|_{P_{s}}. ∎

4 Linear convergence of IDS

In previous sections, we show that the IDS monotonically decays along iterates of PDHG and has a sublinear convergence rate for a convex-concave minimax problem. In this section, we further show that the IDS has linear convergence if the minimax problem further satisfies a regularity condition. This regularity condition is satisfied by many applications, such as unconstrained bilinear problem, linear programming, strongly-convex-strongly-convex problems, etc.

We begin by introducing the following regularity condition.

Definition 2.

We say that the minimax problem (1), or equivalently the operator \mathcal{F}, satisfies metric sub-regularity with respect to matrix PsP_{s} on set 𝕊\mathbb{S} if it holds for any z𝕊z\in\mathbb{S} that

αsdistPs(z,𝒵)distPs1(0,(z)).\alpha_{s}dist_{P_{s}}(z,\mathcal{Z}^{*})\leq dist_{P_{s}^{-1}}(0,\mathcal{F}(z))\ . (9)
Remark 3.

We comment that the metric sub-regularity condition we defined above is a special case of the traditional definition of metric sub-regularity [21] with a set-valued function F(z)F(z) and a vector 0.

Recently, the metric sub-regularity and related sharpness conditions have been used to analyze primal-dual algorithms [26, 48, 1, 4, 60, 62]. In particular, it has been shown that strongly-convex-strongly-concave problems satisfy this condition globally, and piecewise linear quadratic functions satisfy this condition on a compact set [65, 42]. This includes many examples, such as LASSO, support vector machine, linear programming and quadratic programming on a compact region in the primal-dual formulation. Suppose the step-size satisfies s12As\leq\frac{1}{2\|A\|}, one can show that unconstrained bilinear problem and generic linear program satisfy the regularity condition (9). Below we list a few concrete examples that satisfy metric subregularity:

Example 1 (Unconstrained bilinear problem).

The unconstrained bilinear problem

minxnmaxymcTx+yTAxbTy\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}c^{T}x+y^{T}Ax-b^{T}y (10)

satisfies global metric sub-regularity with αs=s2σmin+(A)\alpha_{s}=\frac{s}{2}\sigma_{min}^{+}(A) and 𝕊=m+n\mathbb{S}=\mathbb{R}^{m+n}.

Example 2 (Linear programming).

For any optimal solution zz_{*}, the primal-dual form of linear programming

minxnmaxymcTx+ι+n(x)+yTAxbTy\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}c^{T}x+\iota_{\mathbb{R}_{+}^{n}}(x)+y^{T}Ax-b^{T}y (11)

satisfies the metric sub-regularity condition on a bounded region 𝕊={z:zzPsz0zPs}\mathbb{S}=\{z:\|z-z_{*}\|_{P_{s}}\leq\|z_{0}-z_{*}\|_{P_{s}}\} with αs=s2H(K)1+4R2\alpha_{s}=\frac{s}{2H(K)\sqrt{1+4R^{2}}}, where H(K)H(K) is the Hoffman constant of the KKT system of the LP and R=2z0z+zR=2\|z_{0}-z_{*}\|+\|z_{*}\|. Furthermore, the iterates of PDHG from initial solution z0z_{0} stay in the bounded region 𝕊\mathbb{S}.

Example 3 (Strongly-convex-strongly-concave problem).

Consider minimax problem:

minxnmaxymf(x)+Ax,yg(y)\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}f(x)+\left\langle Ax,y\right\rangle-g(y) (12)

where f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R}, g:mg:\mathbb{R}^{m}\rightarrow\mathbb{R} are μ\mu-strongly convex functions. Then, the problem satisfies the metric sub-regularity condition with αs=sμ4\alpha_{s}=\frac{s\mu}{4}.

The proofs of these examples follow directly from [4, 26].

The next theorem presents the linear convergence of IDS for PDHG under metric sub-regularity:

Theorem 2.

Consider the iterations {zk}k=0\{z_{k}\}_{k=0}^{\infty} of PDHG (Algorithm 3) to solve a convex-concave minimax problem (1). Suppose the step-size s<1As<\frac{1}{\|A\|}, and the minimax problem (1) satisfies metric sub-regularity condition (9) on a set 𝕊\mathbb{S} that contains {zk}k=0\{z_{k}\}_{k=0}^{\infty}. Then, it holds for any iteration ke/αs2k\geq\left\lceil e/\alpha_{s}^{2}\right\rceil that

distPs12(0,(zk))exp(1ke/αs2)distPs12(0,(z0)).dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k}))\leq\exp\left({1-\frac{k}{\left\lceil e/\alpha_{s}^{2}\right\rceil}}\right)dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{0}))\ .
Proof.

For any iteration k1k\geq 1, suppose ceαs2k<(c+1)eαs2c\left\lceil\frac{e}{\alpha_{s}^{2}}\right\rceil\leq k<(c+1)\left\lceil\frac{e}{\alpha_{s}^{2}}\right\rceil for a non-negative integer cc. It follows from Theorem 1 that for any z𝒵z_{*}\in\mathcal{Z}^{*}, we have

distPs12(0,(zk))1e/αs2zke/αs2zPs2.\displaystyle\begin{split}dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k}))&\ \leq\frac{1}{{\left\lceil e/\alpha_{s}^{2}\right\rceil}}\|z_{{k-\left\lceil e/\alpha_{s}^{2}\right\rceil}}-z_{*}\|_{P_{s}}^{2}.\end{split} (13)

Taking the minimum over z𝒵z_{*}\in\mathcal{Z}^{*} in the RHS of (13), we obtain

distPs12(0,(zk))1e/αs2distPs2(zke/αs2,𝒵)1αs2e/αs2distPs12(0,(zke/αs2))1edistPs12(0,(zke/αs2)),\displaystyle\begin{split}dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k}))&\ \leq\frac{1}{{\left\lceil e/\alpha_{s}^{2}\right\rceil}}dist_{P_{s}}^{2}(z_{{k-\left\lceil e/\alpha_{s}^{2}\right\rceil}},\mathcal{Z}^{*})\ \leq\frac{1}{\alpha_{s}^{2}{\left\lceil e/\alpha_{s}^{2}\right\rceil}}dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{{k-\left\lceil e/\alpha_{s}^{2}\right\rceil}}))\\ &\ \leq\frac{1}{{e}}dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{{k-\left\lceil e/\alpha_{s}^{2}\right\rceil}}))\ ,\end{split} (14)

where the second inequality utilizes condition (9). Recursively using (14), we arrive at

distPs12(0,(zk))1edistPs12(0,(zke/αs2))(1e)cdistPs12(0,(zkce/αs2))exp(c)distPs12(0,(z0))exp(1ke/αs2)distPs12(0,(z0)).\displaystyle\begin{split}dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k}))&\ \leq\frac{1}{e}dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k-\left\lceil e/\alpha_{s}^{2}\right\rceil}))\leq\cdot\cdot\cdot\leq\left({\frac{1}{e}}\right)^{c}dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k-c\left\lceil e/\alpha_{s}^{2}\right\rceil}))\\ &\ \leq\exp(-c)dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{0}))\\ &\ \leq\exp\left({1-\frac{k}{\left\lceil e/\alpha_{s}^{2}\right\rceil}}\right)dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{0}))\ .\end{split}

where the last two inequalities are due to the monotonicity of distPs12(0,(zk))dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k})) (see Proposition 1) and ceαs2k<(c+1)eαs2c\left\lceil\frac{e}{\alpha_{s}^{2}}\right\rceil\leq k<(c+1)\left\lceil\frac{e}{\alpha_{s}^{2}}\right\rceil, respectively. ∎

Remark 4.

Theorem 2 implies that we need in total 𝒪((1αs)2log(1ϵ))\mathcal{O}\left({\left({\frac{1}{\alpha_{s}}}\right)^{2}\log\left({\frac{1}{\epsilon}}\right)}\right) iterations to find an ϵ\epsilon-close solution zz such that distPs12(0,(z))ϵdist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z))\leq\epsilon.

Remark 5.

Similar to Corollary 1, it is straight-forward to derive the linear convergence rate on duality gap for the last iterate of PDHG under metric sub-regularity condition from Theorem 2. In particular, it holds that

(xk,y)(x,yk)exp(ke/αs22e/αs2)distPs1(0,(z0))(distPs(z0,𝒵)+zzPs).\displaystyle\begin{split}\mathcal{L}(x_{k},y)-\mathcal{L}(x,y_{k})\leq\exp\left({-\frac{k-\left\lceil e/\alpha_{s}^{2}\right\rceil}{2\left\lceil e/\alpha_{s}^{2}\right\rceil}}\right)dist_{P_{s}^{-1}}(0,\mathcal{F}(z^{0}))\left({dist_{P_{s}}(z_{0},\mathcal{Z}^{*})+\|z_{*}-z\|_{P_{s}}}\right)\ .\end{split}

5 Tightness

In this section, we demonstrate that the derived sublinear and linear rates of the last iteration of PDHG in previous sections are tight, i.e., there exists an instance on which the obtained convergence rate cannot be improved. The tightness herein does not refer to a lower bound result, i.e., this argument does not rule out primal-dual first-order algorithms that can achieve even better performance in the decay of IDS.

5.1 Linear convergence

To begin with, Theorem 3 presents a convex-concave instance that satisfies the metric sub-regularity condition (9), on which the last iteration of PDHG requires at least Ω((1αs)2log1ϵ)\Omega\left({\left({\frac{1}{\alpha_{s}}}\right)^{2}\log\frac{1}{\epsilon}}\right) iterations to identify a solution such that distPs12(0,(z))ϵdist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z))\leq\epsilon. Combined with Theorem 2, we conclude that for problems satisfying condition (9) the linear rate derived in Theorem 2 is tight up to a constant.

Theorem 3.

For any iteration k1k\geq 1 and step-size s12As\leq\frac{1}{2\|A\|}, there exist an initial point z0z_{0} and a convex-concave minimax problem (1) satisfying the metric sub-regularity condition (9) with αs<12\alpha_{s}<\frac{1}{2}, such that the PDHG iterates {zk}\{z_{k}\} satisfy

distPs12(0,(zk))112(14αs2)kdistPs12(0,(z0)).dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k}))\geq\frac{1}{12}\left({1-4\alpha_{s}^{2}}\right)^{k}dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{0}))\ . (15)

To construct tight instances, we utilize the following intermediate result from [4, Section 5.2]:

Proposition 2.

Consider a 2-dimensional vector sequence {ak}\{a_{k}\} such that

ak+1=(12s2σ2sσsσ1)ak,\displaystyle\begin{split}a_{k+1}=\begin{pmatrix}1-2s^{2}\sigma^{2}&-s\sigma\\ s\sigma&1\end{pmatrix}a_{k}\ ,\end{split}

with σ>0\sigma>0 and sσ<1/2s\sigma<1/2. Then, it holds for any k0k\geq 0 that

ak2213(1s2σ2)ka022.\left\|a_{k}\right\|_{2}^{2}\geq\frac{1}{3}(1-s^{2}\sigma^{2})^{k}\left\|a_{0}\right\|_{2}^{2}\ .
Proof of Theorem 2.

Let f(x)=g(y)=0f(x)=g(y)=0, xm,ymx\in\mathbb{R}^{m},y\in\mathbb{R}^{m} and A=diag(σ1,,σm)m×mA=\text{diag}(\sigma_{1},...,\sigma_{m})\in\mathbb{R}^{m\times m} with 0<σ1σm0<\sigma_{1}\leq...\leq\sigma_{m}. Set the initial point z0=e1=(1,0,,0)m+nz_{0}=e_{1}=(1,0,\ldots,0)\in\mathbb{R}^{m+n} as the first standard basis vector. Then, we have

distPs12(0,(z0))=F(z0)Ps12s2F(z0)22>0,dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{0}))=\|F(z_{0})\|_{P_{s}^{-1}}^{2}\geq\frac{s}{2}\|F(z_{0})\|_{2}^{2}>0\ ,

where the first inequality uses Ps1s2IP_{s}^{-1}\succeq\frac{s}{2}I. Furthermore, the PDHG update can be rewritten following from (4) as

(IsATsAI)zk+1=(IsATsAI)zks(0ATA0)zk+1,\begin{pmatrix}I&-sA^{T}\\ -sA&I\end{pmatrix}z_{k+1}=\begin{pmatrix}I&-sA^{T}\\ -sA&I\end{pmatrix}z_{k}-s\begin{pmatrix}0&A^{T}\\ -A&0\end{pmatrix}z_{k+1}\ ,

which is equivalent to the update rule

zk+1=(I02sAI)1(IsATsAI)zk=(IsATsAI2s2AAT)zk,z_{k+1}=\begin{pmatrix}I&0\\ -2sA&I\end{pmatrix}^{-1}\begin{pmatrix}I&-sA^{T}\\ -sA&I\end{pmatrix}z_{k}=\begin{pmatrix}I&-sA^{T}\\ sA&I-2s^{2}AA^{T}\end{pmatrix}z_{k}\ , (16)

thus it holds that

F(zk+1)=(0ATA0)zk+1=(0ATA0)(IsATsAI2s2AAT)zk=(0ATA0)(IsATsAI2s2AAT)(0A1AT0)(0ATA0)zk=(I2s2ATAsATsAI)(0ATA0)zk=(I2s2ATAsATsAI)F(zk).\displaystyle\begin{split}F(z_{k+1})=\begin{pmatrix}0&A^{T}\\ -A&0\end{pmatrix}z_{k+1}&\ =\begin{pmatrix}0&A^{T}\\ -A&0\end{pmatrix}\begin{pmatrix}I&-sA^{T}\\ sA&I-2s^{2}AA^{T}\end{pmatrix}z_{k}\\ &\ =\begin{pmatrix}0&A^{T}\\ -A&0\end{pmatrix}\begin{pmatrix}I&-sA^{T}\\ sA&I-2s^{2}AA^{T}\end{pmatrix}\begin{pmatrix}0&-A^{-1}\\ A^{-T}&0\end{pmatrix}\begin{pmatrix}0&A^{T}\\ -A&0\end{pmatrix}z_{k}\\ &\ =\begin{pmatrix}I-2s^{2}A^{T}A&-sA^{T}\\ sA&I\end{pmatrix}\begin{pmatrix}0&A^{T}\\ -A&0\end{pmatrix}z_{k}=\begin{pmatrix}I-2s^{2}A^{T}A&-sA^{T}\\ sA&I\end{pmatrix}F(z_{k})\ .\end{split}

where in the second equality we plug in the update rule (16) and the third equality uses the non-singularity of matrix AA.

Notice that AA is a diagonal matrix, thus we have

F(zk)Ps12s2F(zk)22s2((F(zk)1)2+(F(zk)m+1)2)s6(1s2σ12)k((F(z0)1)2+(F(z0)m+1)2)=s6(1s2σ12)kF(z0)22112(1s2σ12)kF(z0)Ps12,\displaystyle\begin{split}\|F(z_{k})\|_{P_{s}^{-1}}^{2}&\ \geq\frac{s}{2}\|F(z_{k})\|_{2}^{2}\\ &\ \geq\frac{s}{2}\left({(F(z_{k})_{1})^{2}+(F(z_{k})_{m+1})^{2}}\right)\\ &\ \geq\frac{s}{6}(1-s^{2}\sigma_{1}^{2})^{k}\left({(F(z_{0})_{1})^{2}+(F(z_{0})_{m+1})^{2}}\right)\\ &\ =\frac{s}{6}(1-s^{2}\sigma_{1}^{2})^{k}\|F(z_{0})\|_{2}^{2}\\ &\ \geq\frac{1}{12}(1-s^{2}\sigma_{1}^{2})^{k}\|F(z_{0})\|_{P_{s}^{-1}}^{2}\ ,\end{split} (17)

where F(z)iF(z)_{i} denotes the ii-th coordinate of gradient vector F(z)F(z). The first and last inequalities are due to the choice of step-size ss such that s2IPs12sI\frac{s}{2}I\preceq P_{s}^{-1}\preceq 2sI, and the third inequality follows from Proposition 2 and the fact that AA is diagonal.

We finish the proof by noticing for this instance that αs=s2σmin+(A)=s2σ1\alpha_{s}=\frac{s}{2}\sigma_{min}^{+}(A)=\frac{s}{2}\sigma_{1} (see Example 1). ∎

5.2 Sublinear convergence

Next, Theorem 4 presents a convex-concave instance on which PDHG requires at least Ω(1ϵ)\Omega\left({\frac{1}{\epsilon}}\right) to achieve a solution such that distPs12(0,(z))ϵdist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z))\leq\epsilon. Combined with Theorem 1, we conclude that the sublinear rate derived in Theorem 1 is tight upto a constant.

Theorem 4.

For any iteration k1k\geq 1 and step-size s=1cAs=\frac{1}{c\|A\|} with c2c\geq 2, there exist a convex-concave minimax problem (1) and an initial point z0z_{0} such that the PDHG iterates {zk}\{z_{k}\} satisfy

distPs12(0,(zk))148c2e4/c2z0zPs2k.dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k}))\geq\frac{1}{48c^{2}e^{4/c^{2}}}\frac{\|z_{0}-z_{*}\|_{P_{s}}^{2}}{k}\ . (18)
Proof.

To prove Theorem 4, we utilize the same instance constructed in Theorem 3 with carefully chosen singular values of the matrix AA. In particular, we set σ1=LAk\sigma_{1}=\frac{L_{A}}{\sqrt{k}} and σ2==σm=LA\sigma_{2}=...=\sigma_{m}=L_{A}. Then we have A=LA\|A\|=L_{A} and αs=σ1cσm=1ck\alpha_{s}=\frac{\sigma_{1}}{c\sigma_{m}}=\frac{1}{c\sqrt{k}}.

It follows from Theorem 3 that

distPs12(0,(zk))=F(zk)Ps12112(14αs2)kF(z0)Ps12s24(14c2k)kF(z0)22,dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k}))=\|F(z_{k})\|_{P_{s}^{-1}}^{2}\geq\frac{1}{12}\left({1-4\alpha_{s}^{2}}\right)^{k}\|F(z_{0})\|_{P_{s}^{-1}}^{2}\geq\frac{s}{24}\left({1-\frac{4}{c^{2}k}}\right)^{k}\|F(z_{0})\|_{2}^{2}\ ,

where the first inequality is from Theorem 3 and the last inequality uses Ps1s2IP_{s}^{-1}\succeq\frac{s}{2}I. Furthermore, notice that

F(z0)2=(0ATA0)z022=z0T(ATA00AAT)z0σ12z022,\|F(z_{0})\|^{2}=\left\|\begin{pmatrix}0&A^{T}\\ A&0\end{pmatrix}z_{0}\right\|_{2}^{2}=z_{0}^{T}\begin{pmatrix}A^{T}A&0\\ 0&AA^{T}\end{pmatrix}z_{0}\geq\sigma_{1}^{2}\|z_{0}\|_{2}^{2}\ ,

it thus holds that

distPs12(0,(zk))s24(14c2k)k(LAk)2z022=s24(14c2k)kLA2kz0z22.\displaystyle\begin{split}dist_{P_{s}^{-1}}^{2}(0,\mathcal{F}(z_{k}))\geq\frac{s}{24}\left({1-\frac{4}{c^{2}k}}\right)^{k}\left(\frac{L_{A}}{\sqrt{k}}\right)^{2}\|z_{0}\|_{2}^{2}=\frac{s}{24}\left({1-\frac{4}{c^{2}k}}\right)^{k}\frac{L_{A}^{2}}{k}\|z_{0}-z_{*}\|_{2}^{2}\ .\end{split} (19)

Meanwhile, it holds that

s24(14c2k)kLA2kz0z22s24e4/c2LA2z0z22ks248e4/c2LA2z0zPs2k=148c2e4/c2z0zPs2k,\displaystyle\begin{split}\frac{s}{24}\left({1-\frac{4}{c^{2}k}}\right)^{k}\frac{L_{A}^{2}}{k}\|z_{0}-z_{*}\|_{2}^{2}\ \geq\frac{s}{24e^{4/c^{2}}}\frac{L_{A}^{2}\|z_{0}-z_{*}\|_{2}^{2}}{k}\ \geq\frac{s^{2}}{48e^{4/c^{2}}}\frac{L_{A}^{2}\|z_{0}-z_{*}\|_{P_{s}}^{2}}{k}\ =\frac{1}{48c^{2}e^{4/c^{2}}}\frac{\|z_{0}-z_{*}\|_{P_{s}}^{2}}{k}\ ,\end{split} (20)

where the first inequality uses (14c2k)ke4/c2\left({1-\frac{4}{c^{2}k}}\right)^{k}\geq e^{-4/c^{2}}, the second inequality is from Ps2sIP_{s}\preceq\frac{2}{s}I, and the last equality follows from the choice of step-size s=1cA12As=\frac{1}{c\|A\|}\leq\frac{1}{2\|A\|}. We finish the proof by combining (19) and (20). ∎

6 Efficient Computation and Numerical Results

In this section, we present a linear-time algorithm to compute IDS for a general problem (1), and present preliminary numerical experiments on linear programming that validate our theoretical results.

6.1 Linear-time computation of IDS

Computing IDS (2) involves computing the projection of 0 onto the sub-differential set (z)\mathcal{F}(z) with respect to the norm Ps1\|\cdot\|_{P_{s}^{-1}}, which may not always be trivial. Here, we present a principle approach to efficiently compute IDS in linear time by using Nesterov’s accelerated gradient descent (AGD), under the assumption that the non-smooth part of \mathcal{F} is simple. In our numerical experiments, usually 15 iterations of AGD can lead to a high-resolution solution to IDS.

Consider the convex-concave minimax problem (1). The sub-differential at a solution zz is given by

(z)=(x,y)=(ATyAx)+(f(x)g(y)).\mathcal{F}(z)=\mathcal{F}(x,y)=\begin{pmatrix}A^{T}y\\ -Ax\end{pmatrix}+\begin{pmatrix}\partial f(x)\\ \partial g(y)\end{pmatrix}\ .

We denote d(z):=(ATyAx)d(z):=\begin{pmatrix}A^{T}y\\ -Ax\end{pmatrix} and 𝒢(z):=(f(x)g(y))\mathcal{G}(z):=\begin{pmatrix}\partial f(x)\\ \partial g(y)\end{pmatrix}, then the projection problem can be formulated as solving the following quadratic programming:

minω𝒢ω+dPs12=ωTPs1ω+2dTPs1ω+dTPs1d.\displaystyle\begin{split}\min_{\omega\in\mathcal{G}}\left\|\omega+d\right\|_{P_{s}^{-1}}^{2}=\omega^{T}P_{s}^{-1}\omega+2d^{T}P_{s}^{-1}\omega+d^{T}P_{s}^{-1}d\ .\end{split} (21)

For most of the applications of PDHG, the sub-gradient set 𝒢(z)\mathcal{G}(z) as well as the projection onto 𝒢(z)\mathcal{G}(z) are straight-forward to compute. For example, when f(x)=ι𝒞(x)f(x)=\iota_{\mathcal{C}}(x) is an indicator function of closed convex set 𝒞\mathcal{C}, the subdifferential is its normal cone, that is, f(x)=N𝒞(x)\partial f(x)=N_{\mathcal{C}}(x). If f(x)=x1f(x)=\|x\|_{1} is l1l_{1} norm, then f(x)={g:g1,gTx=x1}\partial f(x)=\{g:\|g\|_{\infty}\leq 1,g^{T}x=\|x\|_{1}\}.

Notice that (21) is a convex quadratic programming. A natural algorithm for solving the QP (21) is Nesterov’s accelerated gradient descent, which we formally present in Algorithm 2. Recall that throughout the section, we assume the step size ss satisfies s12As\leq\frac{1}{2\|A\|}. Thus, we have s2IPs12sI\frac{s}{2}I\preceq P_{s}^{-1}\preceq 2sI and the condition number of Ps1P_{s}^{-1} is κ(Ps1)4\kappa(P_{s}^{-1})\leq 4. As the result, this QP is easy to solve and the complexity of AGD for solving the problem is 𝒪(log1ϵ)\mathcal{O}\left({\log\frac{1}{\epsilon}}\right) [52]. Since the complexity is independent of the dimension of the problem, we have a linear time computation of IDS. In our numerical experiments, we found that it usually takes less than 15 iterations of Nesterov’s accelerated methods to solve the subproblem (21) almost exactly, which further verifies the effectiveness of AGD.

Algorithm 2 Accelerated Gradient Descent for solving (21)
0:  Initial point w0w_{0}, step-size β\beta, condition number κ\kappa.
1:  for k=0,1,k=0,1,... do
2:     wk+1=Proj𝒢(uk2βPs1(uk+d))w_{k+1}=\text{Proj}_{\mathcal{G}}\left({u_{k}-\frac{2}{\beta}P_{s}^{-1}(u_{k}+d)}\right)
3:     uk+1=wk+1+κ1κ+1(wk+1wk)u_{k+1}=w_{k+1}+\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}(w_{k+1}-w_{k})
4:  end for

6.2 Numerical experiments on PDHG for LP

Here, we present preliminary numerical results on PDHG for LP that verify our theoretical results in the previous sections.

Datasets. We utilize the root-node LP relaxation of three datasets from MIPLIB, enlight_hard, supportcase14 and nexp-50-20-4-2. The dimension (i.e. nn and mm) of the three datasets are summarized in Table 1. These instances are chosen such that vanilla PDHG converges to an optimal solution within a reasonable number of iterations, and similar results hold for other instances.

Implementation details. We run PDHG (Algorithm 3) on the three instances from MIPLIB with step-size s=12As=\frac{1}{2\|A\|}. We calculate IDS using AGD (Algorithm 2). We terminate Algorithm 2 when the first order optimality condition of (21) holds with 101010^{-10} tolerance (this effectively means we solve the subproblem (21) up to the machine precision). We summarize the average iterations of AGD in Table 1.

Results. Figure 2 presents IDS and KKT residual versus number of iterations of PDHG for the three LP instances. To make them at the same scale, we square KKT residual (recall that the definition of IDS also has a square). As we can see, IDS monotonically decays, which verifies Proposition 1. In contrast, there can be a huge fluctuation in KKT residual, which showcases that IDS is a more natural progress metric for PDHG. Furthermore, as we can see, the decay of IDS follows a linear rate, which verifies our theoretical results in Theorem 2.

Instance Size Average iteration of AGD
enlight_hard 100×200100\times 200 12.6
supportcase14 234×304234\times 304 15.0
nexp-50-20-4-2 540×1225540\times 1225 13.04
Table 1: Test instance sizes and average iteration numbers of Algorithm 2
Refer to caption
(a) enlight_hard
Refer to caption
(b) supportcase14
Refer to caption
(c) nexp-50-20-4-2
Figure 2: Infimal sub-differential size v.s. iterations

7 Extensions to Other Algorithms

The results on IDS in Section 2-4 are not limited to PDHG. In this section, we showcase how to extend these results to three classic algorithms, proximal point methods (PPM), alternating direction method of multipliers (ADMM) and linearized ADMM (l-ADMM).

The cornerstone of the analysis in Section 2-4 is to identify a positive definite matrix 𝒫\mathcal{P} such that the update rule for iterate zk+1z_{k+1} has the form

𝒫(zkzk+1)(zk+1).\mathcal{P}(z_{k}-z_{k+1})\in\mathcal{F}(z_{k+1})\ . (22)

This is the only fact of PDHG that we used in our analysis. Indeed, if an algorithm can be formulated as (22), we can define IDS as

dist𝒫12(0,(z)),dist_{\mathcal{P}^{-1}}^{2}(0,\mathcal{F}(z))\ , (23)

and the corresponding monotonicity (Proposition 1) and complexity analysis of IDS (Theorem 1 and Theorem 2) can be shown with exactly the same analysis as PDHG.

In the rest of this section, we show how to reformulate the iterate updates of PPM, ADMM and l-ADMM as an instance of (22), and present the corresponding results for the three algorithms. A mild difference between ADMM and the other algorithms is that the matrix 𝒫\mathcal{P} in ADMM is not full rank, and the inverse is not well-defined. We overcome this by defining IDS along the subspace where 𝒫\mathcal{P} is full rank.

We also would like to comment that positive semidefinite matrix 𝒫\mathcal{P} of different algorithms can be different. For example, 𝒫=(1sInATA1sIm)\mathcal{P}=\begin{pmatrix}\frac{1}{s}I_{n}&-A^{T}\\ -A&\frac{1}{s}I_{m}\end{pmatrix} in PDHG, 𝒫=1sI\mathcal{P}=\frac{1}{s}I in PPM, 𝒫=(1τIATAτAAT)\mathcal{P}=\begin{pmatrix}\frac{1}{\tau}I&-A^{T}\\ -A&\tau AA^{T}\end{pmatrix} in ADMM and 𝒫=(1τIATAλI)\mathcal{P}=\begin{pmatrix}\frac{1}{\tau}I&-A^{T}\\ -A&\lambda I\end{pmatrix} in l-ADMM. Consequently, the distance in the definition of IDS are chosen differently, and the trajectories of different algorithms may follow a different ellipsoid specified by the matrix 𝒫\mathcal{P} in Figure 1.

7.1 Proximal Point Method (PPM)

Consider a convex-concave minimax problem:

minxmaxy(x,y),\min_{x}\max_{y}\mathcal{L}(x,y)\ , (24)

where (x,y)\mathcal{L}(x,y) is a potentially non-differentiable function that is convex in xx and concave in yy. Proximal point method (PPM) [55] (Algorithm 3) is a classic algorithm for solving (24), and some of the other classic algorithms, such as extra-gradient method [51], can be viewed as approximation of PPM.

Algorithm 3 Proximal Point Method for (24)
0:  Initial point (x0,y0)(x_{0},y_{0}), step-size s>0s>0.
1:  for k=0,1,k=0,1,... do
2:     (xk+1,yk+1)=argminxargmaxy{(x,y)+12sxxk2212syyk22}(x_{k+1},y_{k+1})=\arg\min_{x}\arg\max_{y}\left\{\mathcal{L}(x,y)+\frac{1}{2s}\|x-x^{k}\|_{2}^{2}-\frac{1}{2s}\|y-y^{k}\|_{2}^{2}\right\}
3:  end for

Denote (z)=(x(x,y)y(x,y))\mathcal{F}(z)=\begin{pmatrix}\partial_{x}\mathcal{L}(x,y)\\ -\partial_{y}\mathcal{L}(x,y)\end{pmatrix}. We can rewrite the update rule of PPM as follows:

1s(zkzk+1)(zk+1).\frac{1}{s}(z_{k}-z_{k+1})\in\mathcal{F}(z_{k+1})\ .

Parallel to the results of PDHG stated in Section 2-4, we obtain:

Theorem 5.

Consider the iterates {zk}k=0\{z_{k}\}_{k=0}^{\infty} of PPM (Algorithm 3) for solving a convex-concave minimax problem (24). Then it holds for k1k\geq 1 that:

(i) monotonically decay

dist2(0,(zk+1))dist2(0,(zk)).dist^{2}\left({0,\mathcal{F}(z_{k+1})}\right)\leq dist^{2}\left({0,\mathcal{F}(z_{k})}\right)\ .

(ii) sublinear convergence

dist2(0,(zk))1kz0z22.dist^{2}(0,\mathcal{F}(z_{k}))\leq\frac{1}{k}\|z_{0}-z_{*}\|_{2}^{2}\ .

(iii) Furthermore, assume the minimax problem (24) satisfies metric sub-regularity condition (9) under l2l_{2} norm on a set 𝕊\mathbb{S} that contains {zk}k=0\{z_{k}\}_{k=0}^{\infty}, that is, for some constant α>0\alpha>0,

αdist(z,𝒵)dist(0,(z)).\alpha dist(z,\mathcal{Z}^{*})\leq dist(0,\mathcal{F}(z))\ .

Then, it holds for any iteration ke/α2k\geq\left\lceil e/\alpha^{2}\right\rceil that

dist2(0,(zk))exp(1ke/α2)dist2(0,(z0)).dist^{2}(0,\mathcal{F}(z_{k}))\leq\exp\left({1-\frac{k}{\left\lceil e/\alpha^{2}\right\rceil}}\right)dist^{2}(0,\mathcal{F}(z_{0}))\ .

7.2 Linearized ADMM

Consider a minimization problem of form:

minxf(x)+g(Ax),\min_{x}\;f(x)+g(Ax)\ , (25)

where ff and gg are convex but potentially non-smooth functions. Examples of (25) include LASSO [12, 57], SVM [6, 15], regularized logistic regression [39], image recovery [10, 7], etc. The primal-dual formulation of (25) is

minxnmaxym(x,y)=f(x)+Ax,yg(y),\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}\mathcal{L}(x,y)=f(x)+\left\langle Ax,y\right\rangle-g^{*}(y)\ , (26)

where gg^{*} is the conjugate function of gg. Linearized ADMM (l-ADMM) [56] is a variant of ADMM that avoids solving linear equations. Algorithm 4 presents l-ADMM for solving (25).

Algorithm 4 Linearized ADMM
0:  Initial point (x0,y0)(x_{0},y_{0}), step-size λ\lambda, τ\tau, and conjugate function ff^{*}, gg^{*}.
1:  for k=0,1,k=0,1,... do
2:     wk+1=argminw{f(w)w,vkτ(ATyk+wk)+τ2wwk22}w_{k+1}=\arg\min_{w}\left\{f^{*}(w)-\langle w,v_{k}-\tau(A^{T}y_{k}+w_{k})\rangle+\frac{\tau}{2}\|w-w^{k}\|_{2}^{2}\right\}
3:     vk+1=vkτ(ATyk+wk+1)v_{k+1}=v_{k}-\tau(A^{T}y_{k}+w_{k+1})
4:     yk+1=argminy{g(y)ATy,vk+1τ(ATyk+wk+1)+λ2yyk22}y_{k+1}=\arg\min_{y}\left\{g^{*}(y)-\langle A^{T}y,v_{k+1}-\tau(A^{T}y_{k}+w_{k+1})\rangle+\frac{\lambda}{2}\|y-y_{k}\|_{2}^{2}\right\}
5:  end for

As the first step, we here aim to find the appropriate differential inclusion form of the update rule, that is, find a positive (semi-)definite matrix 𝒫\mathcal{P} such that the update of the algorithm can be rewritten as

𝒫(zkzk+1)(zk+1),\mathcal{P}(z_{k}-z_{k+1})\in\mathcal{F}(z_{k+1})\ ,

where (z)=(f(x)+ATyg(y)Ax)\mathcal{F}(z)=\begin{pmatrix}\partial f(x)+A^{T}y\\ \partial g^{*}(y)-Ax\end{pmatrix}.

Notice that

vk+1=vkτATykτwk+1=(vkτATyk)τproxfτ(1τ(vkτATyk))=(vkτATyk)(vkτATyk)+proxf1τ(vkτATyk)=proxf1τ(vkτATyk),\displaystyle\begin{split}v_{k+1}&\ =v_{k}-\tau A^{T}y_{k}-\tau w_{k+1}\\ &\ =(v_{k}-\tau A^{T}y_{k})-\tau\mathrm{prox}_{f^{*}}^{\tau}\left({\frac{1}{\tau}(v_{k}-\tau A^{T}y_{k})}\right)\\ &\ =(v_{k}-\tau A^{T}y_{k})-(v_{k}-\tau A^{T}y_{k})+\mathrm{prox}_{f}^{\frac{1}{\tau}}(v_{k}-\tau A^{T}y_{k})\\ &\ =\mathrm{prox}_{f}^{\frac{1}{\tau}}(v_{k}-\tau A^{T}y_{k})\ ,\end{split} (27)

where the second equality uses the update of wk+1=proxfτ(1τ(vkτATyk))w_{k+1}=\mathrm{prox}_{f^{*}}^{\tau}\left({\frac{1}{\tau}(v_{k}-\tau A^{T}y_{k})}\right) and the third equality follows from the basic property of proximal operator proxfτ(t)=t1τproxf1τ(τt)\mathrm{prox}_{f^{*}}^{\tau}(t)=t-\frac{1}{\tau}\mathrm{prox}_{f}^{\frac{1}{\tau}}(\tau t).

Moreover, it holds that

yk+1=argminy{g(y)AT(yyk),vk+1τ(wk+1+ATyk)+λ2yyk22}=argminy{g(y)yyk,A(2vk+1vk)+λ2yyk22}=proxgλ(yk+1λA(2vk+1vk)).\displaystyle\begin{split}y_{k+1}&\ =\arg\min_{y}\left\{g^{*}(y)-\langle A^{T}(y-y_{k}),v_{k+1}-\tau(w_{k+1}+A^{T}y_{k})\rangle+\frac{\lambda}{2}\|y-y_{k}\|_{2}^{2}\right\}\\ &\ =\arg\min_{y}\left\{g^{*}(y)-\langle y-y_{k},A(2v_{k+1}-v_{k})\rangle+\frac{\lambda}{2}\|y-y_{k}\|_{2}^{2}\right\}\\ &\ =\mathrm{prox}_{g^{*}}^{\lambda}\left({y_{k}+\frac{1}{\lambda}A(2v_{k+1}-v_{k})}\right)\ .\end{split} (28)

Denote 𝒫=(1τIATAλI)\mathcal{P}=\begin{pmatrix}\frac{1}{\tau}I&-A^{T}\\ -A&\lambda I\end{pmatrix}, z=(v,y)z=(v,y). Rearranging (27) and (28), we arrive at

𝒫(zkzk+1)=(1τInATAλIm)(vkvk+1ykyk+1)=(1τ(vkvk+1)AT(ykyk+1)A(vkvk+1)+λ(ykyk+1))(f(vk+1)+ATyk+1Avk+1+g(yk+1))=(zk+1).\displaystyle\begin{split}\mathcal{P}(z_{k}-z_{k+1})&\ =\begin{pmatrix}\frac{1}{\tau}I_{n}&-A^{T}\\ -A&\lambda I_{m}\end{pmatrix}\begin{pmatrix}v_{k}-v_{k+1}\\ y_{k}-y_{k+1}\end{pmatrix}=\begin{pmatrix}\frac{1}{\tau}(v_{k}-v_{k+1})-A^{T}(y_{k}-y_{k+1})\\ -A(v_{k}-v_{k+1})+\lambda(y_{k}-y_{k+1})\end{pmatrix}\\ &\ \in\begin{pmatrix}\partial f(v_{k+1})+A^{T}y_{k+1}\\ -Av_{k+1}+\partial g^{*}(y_{k+1})\end{pmatrix}=\mathcal{F}(z_{k+1})\ .\end{split}

Then, utilizing exactly the same analysis on IDS that we have developed for PDHG in Section 2-4, we reach the results for linearized ADMM:

Theorem 6.

Consider the iterates {(vk,yk)}k=0\{(v_{k},y_{k})\}_{k=0}^{\infty} of Linearized ADMM (Algorithm 4) for solving problem (25). Suppose λ>τA2\lambda>\tau\|A\|^{2} and denote 𝒫=(1τIATAλI)\mathcal{P}=\begin{pmatrix}\frac{1}{\tau}I&-A^{T}\\ -A&\lambda I\end{pmatrix}. Then it holds for k1k\geq 1 that:

(i) monotonically decay

dist𝒫12(0,(vk+1,yk+1))dist𝒫12(0,(vk,yk))\mathrm{dist}^{2}_{\mathcal{P}^{-1}}(0,\mathcal{F}(v_{k+1},y_{k+1}))\leq\mathrm{dist}^{2}_{\mathcal{P}^{-1}}(0,\mathcal{F}(v_{k},y_{k}))

(ii) sublinear convergence

dist𝒫12(0,(vk,yk))1k(v0,y0)(v,y)𝒫2.dist_{\mathcal{P}^{-1}}^{2}(0,\mathcal{F}(v_{k},y_{k}))\leq\frac{1}{k}\|(v_{0},y_{0})-(v_{*},y_{*})\|_{\mathcal{P}}^{2}\ .

(iii) Furthermore, assume the minimax problem (26) satisfies metric sub-regularity condition (9) for 𝒫\mathcal{P} on a set 𝕊\mathbb{S} that contains {(vk,yk)}k=0\{(v_{k},y_{k})\}_{k=0}^{\infty}, that is, for some constant α𝒫>0\alpha_{\mathcal{P}}>0,

α𝒫dist𝒫((v,y),1(0))dist𝒫1(0,(v,y)).\alpha_{\mathcal{P}}dist_{\mathcal{P}}((v,y),\mathcal{F}^{-1}(0))\leq dist_{\mathcal{P}^{-1}}(0,\mathcal{F}(v,y))\ .

Then, it holds for any iteration ke/α𝒫2k\geq\left\lceil e/\alpha_{\mathcal{P}}^{2}\right\rceil that

dist𝒫12(0,(vk,yk))exp(1ke/α𝒫2)dist𝒫12(0,(v0,y0)).dist_{\mathcal{P}^{-1}}^{2}(0,\mathcal{F}(v_{k},y_{k}))\leq\exp\left({1-\frac{k}{\left\lceil e/\alpha_{\mathcal{P}}^{2}\right\rceil}}\right)dist_{\mathcal{P}^{-1}}^{2}(0,\mathcal{F}(v_{0},y_{0}))\ .

7.3 Alternating Direction Method of Multipliers (ADMM)

Algorithm 5 ADMM
0:  Initial point (x0,y0)(x_{0},y_{0}), step-size τ>0\tau>0.
1:  for k=0,1,k=0,1,... do
2:     wk+1=argminw{f(w)ATyk+w,vk+τ2w+ATyk22}w_{k+1}=\arg\min_{w}\left\{f^{*}(w)-\langle A^{T}y_{k}+w,v_{k}\rangle+\frac{\tau}{2}\|w+A^{T}y_{k}\|_{2}^{2}\right\}
3:     vk+1=vkτ(ATyk+wk+1)v_{k+1}=v_{k}-\tau(A^{T}y_{k}+w_{k+1})
4:     yk+1=argminy{g(y)ATy+wk+1,vk+1+τ2ATy+wk+122}y_{k+1}=\arg\min_{y}\left\{g^{*}(y)-\langle A^{T}y+w_{k+1},v_{k+1}\rangle+\frac{\tau}{2}\|A^{T}y+w_{k+1}\|_{2}^{2}\right\}
5:  end for

In this subsection, we study the property of IDS of vanilla ADMM for solving (25). Similar to the previous sections, we first rewrite the update rule of ADMM as an instance of (22).

Notice that

vk+1=vkτATykτwk+1=(vkτATyk)τproxfτ(1τ(vkτATyk))=(vkτATyk)(vkτATyk)+proxf1τ(vkτATyk)=proxf1τ(vkτATyk),\displaystyle\begin{split}v_{k+1}&\ =v_{k}-\tau A^{T}y_{k}-\tau w_{k+1}\\ &\ =(v_{k}-\tau A^{T}y_{k})-\tau\mathrm{prox}_{f^{*}}^{\tau}\left({\frac{1}{\tau}(v_{k}-\tau A^{T}y_{k})}\right)\\ &\ =(v_{k}-\tau A^{T}y_{k})-(v_{k}-\tau A^{T}y_{k})+\mathrm{prox}_{f}^{\frac{1}{\tau}}(v_{k}-\tau A^{T}y_{k})\\ &\ =\mathrm{prox}_{f}^{\frac{1}{\tau}}(v_{k}-\tau A^{T}y_{k})\ ,\end{split} (29)

where the second equality uses the update of wk+1=proxfτ(1τ(vkτATyk))w_{k+1}=\mathrm{prox}_{f^{*}}^{\tau}\left({\frac{1}{\tau}(v_{k}-\tau A^{T}y_{k})}\right) and the third equality follows from the basic property of proximal operator proxfτ(t)=t1τproxf1τ(τt)\mathrm{prox}_{f^{*}}^{\tau}(t)=t-\frac{1}{\tau}\mathrm{prox}_{f}^{\frac{1}{\tau}}(\tau t).

Furthermore, it holds that

yk+1=argminy{g(y)ATy+wk+1,vk+1+τ2ATy+wk+122}=argminy{g(y)+τ2ATy+wk+11τvk+122}=argminy{g(y)+τ2ATyATyk1τ(2vk+1vk)22}.\displaystyle\begin{split}y_{k+1}&\ =\arg\min_{y}\left\{g^{*}(y)-\langle A^{T}y+w_{k+1},v_{k+1}\rangle+\frac{\tau}{2}\|A^{T}y+w_{k+1}\|_{2}^{2}\right\}\\ &\ =\arg\min_{y}\left\{g^{*}(y)+\frac{\tau}{2}\|A^{T}y+w_{k+1}-\frac{1}{\tau}v_{k+1}\|_{2}^{2}\right\}\\ &\ =\arg\min_{y}\left\{g^{*}(y)+\frac{\tau}{2}\|A^{T}y-A^{T}y_{k}-\frac{1}{\tau}(2v_{k+1}-v_{k})\|_{2}^{2}\right\}\ .\end{split} (30)

Denote 𝒫=(1τIATAτAAT)\mathcal{P}=\begin{pmatrix}\frac{1}{\tau}I&-A^{T}\\ -A&\tau AA^{T}\end{pmatrix}, z=(v,y)z=(v,y). Rearranging (29) and (30), we arrive at

𝒫(zkzk+1)=(1τInATAτAAT)(vkvk+1ykyk+1)=(1τ(vkvk+1)AT(ykyk+1)A(vkvk+1)+τAAT(ykyk+1))(f(vk+1)+ATyk+1Avk+1+g(yk+1))=(zk+1).\displaystyle\begin{split}\mathcal{P}(z_{k}-z_{k+1})&\ =\begin{pmatrix}\frac{1}{\tau}I_{n}&-A^{T}\\ -A&\tau AA^{T}\end{pmatrix}\begin{pmatrix}v_{k}-v_{k+1}\\ y_{k}-y_{k+1}\end{pmatrix}=\begin{pmatrix}\frac{1}{\tau}(v_{k}-v_{k+1})-A^{T}(y_{k}-y_{k+1})\\ -A(v_{k}-v_{k+1})+\tau AA^{T}(y_{k}-y_{k+1})\end{pmatrix}\\ &\ \in\begin{pmatrix}\partial f(v_{k+1})+A^{T}y_{k+1}\\ -Av_{k+1}+\partial g^{*}(y_{k+1})\end{pmatrix}=\mathcal{F}(z_{k+1})\ .\end{split}

A key difference between ADMM and the other methods we present is that 𝒫\mathcal{P} is not invertible, thus 𝒫1\mathcal{P}^{-1} in IDS is not well-defined. Fortunately, this can be overcome by redefining IDS along the non-singular subspace of 𝒫\mathcal{P} as following:

Definition 3.

The infimal sub-differential size (IDS) at solution zz for algorithm with update rule 𝒫(zkzk+1)(zk+1)\mathcal{P}(z_{k}-z_{k+1})\in\mathcal{F}(z_{k+1}) for solving convex-concave minimax problem (1) is defined as:

dist𝒫2(0,(z)range(𝒫)),dist_{\mathcal{P}^{-}}^{2}(0,\mathcal{F}(z)\cap\mathrm{range}(\mathcal{P}))\ , (31)

where 𝒫\mathcal{P}^{-} is the pseudo-inverse of 𝒫\mathcal{P} and range(𝒫)\mathrm{range}(\mathcal{P}) is the range of matrix 𝒫\mathcal{P}. (z)=(f(x)+ATyAx+g(y))\mathcal{F}(z)=\begin{pmatrix}\partial f(x)+A^{T}y\\ -Ax+\partial g(y)\end{pmatrix} is the sub-differential of the objective (x,y)\mathcal{L}(x,y).

Notice that (31) is still a valid progress metric of the solution, by noticing dist𝒫12(0,(zk))=0dist^{2}_{\mathcal{P}^{-1}}(0,\mathcal{F}(z_{k}))=0 implies that 0(zk+1)0\in\mathcal{F}(z_{k+1}). More formally, if IDS equals to zero, it holds that zkzk+1𝒫2=0\|z_{k}-z_{k+1}\|_{\mathcal{P}}^{2}=0 and thus 0=𝒫(zkzk+1)(zk+1)0=\mathcal{P}(z_{k}-z_{k+1})\in\mathcal{F}(z_{k+1}).

All the results we developed before holds for the refined IDS for ADMM with a similar proof, and we present the difference of the proof in Appendix A):

Theorem 7.

Consider the iterates {(vk,yk)}k=0\{(v_{k},y_{k})\}_{k=0}^{\infty} of ADMM (Algorithm 5) for solving problem (25). Denote 𝒫=(1τIATAτAAT)\mathcal{P}=\begin{pmatrix}\frac{1}{\tau}I&-A^{T}\\ -A&\tau AA^{T}\end{pmatrix}. Then it holds for k1k\geq 1 that:

(i) monotonically decay

dist𝒫2(0,(vk+1,yk+1)range(𝒫))dist𝒫2(0,(vk,yk)range(𝒫))\mathrm{dist}^{2}_{\mathcal{P}^{-}}(0,\mathcal{F}(v_{k+1},y_{k+1})\cap\mathrm{range}(\mathcal{P}))\leq\mathrm{dist}^{2}_{\mathcal{P}^{-}}(0,\mathcal{F}(v_{k},y_{k})\cap\mathrm{range}(\mathcal{P}))

(ii) sublinear convergence

dist𝒫2(0,(vk,yk)range(𝒫))1k(v0,y0)(v,y)𝒫2.dist_{\mathcal{P}^{-}}^{2}(0,\mathcal{F}(v_{k},y_{k})\cap\mathrm{range}(\mathcal{P}))\leq\frac{1}{k}\|(v_{0},y_{0})-(v_{*},y_{*})\|_{\mathcal{P}}^{2}\ .

(iii) Furthermore, assume the minimax problem (26) satisfies metric sub-regularity condition (9) for 𝒫\mathcal{P} on a set 𝕊\mathbb{S} that contains {zk}k=0\{z_{k}\}_{k=0}^{\infty}, that is, for some constant α𝒫>0\alpha_{\mathcal{P}}>0,

α𝒫dist𝒫((v,y),1(0))dist𝒫(0,(v,y)).\alpha_{\mathcal{P}}dist_{\mathcal{P}}((v,y),\mathcal{F}^{-1}(0))\leq dist_{\mathcal{P}^{-}}(0,\mathcal{F}(v,y))\ .

Then, it holds for any iteration ke/α𝒫2k\geq\left\lceil e/\alpha_{\mathcal{P}}^{2}\right\rceil that

dist𝒫2(0,(vk,yk)range(𝒫))exp(1ke/α𝒫2)dist𝒫2(0,(v0,y0)range(𝒫)).dist_{\mathcal{P}^{-}}^{2}(0,\mathcal{F}(v_{k},y_{k})\cap\mathrm{range}(\mathcal{P}))\leq\exp\left({1-\frac{k}{\left\lceil e/\alpha_{\mathcal{P}}^{2}\right\rceil}}\right)dist_{\mathcal{P}^{-}}^{2}(0,\mathcal{F}(v_{0},y_{0})\cap\mathrm{range}(\mathcal{P}))\ .

8 Conclusions and future direction

In this paper, we introduce a new progress metric for analyzing PDHG for convex-concave minimax problems, which we call infimal sub-differential size (IDS). IDS is a natural generalization of the gradient norm to non-smooth problems. Compared to other progress metric, IDS is finite, easy to compute, and monotonically decays along the iterates of PDHG. We further show the sublinear convergence rate and linear convergence rate of IDS. The monotonicity and simplicity of the analysis suggest that IDS is the right progress metric for analyzing PDHG. All of the results we obtain here can be directly extended to PPM, ADMM and l-ADMM. A direct future direction is whether the properties of IDS hold for other primal-dual algorithms. Another future direction is whether IDS suggests good step-size and/or preconditioner for primal-dual algorithms.

Acknowledgement

The authors would like to thank David Applegate, Oliver Hinder, Miles Lubin and Warren Schudy for the helpful discussions that motivated the authors to identify a monotonically decaying progress metric for PDLP.

References

  • [1] Ahmet Alacaoglu, Olivier Fercoq, and Volkan Cevher, On the convergence of stochastic primal-dual hybrid gradient, arXiv preprint arXiv:1911.00799 (2019).
  • [2] David Applegate, Mateo Díaz, Oliver Hinder, Haihao Lu, Miles Lubin, Brendan O’Donoghue, and Warren Schudy, Practical large-scale linear programming using primal-dual hybrid gradient, Advances in Neural Information Processing Systems 34 (2021).
  • [3] David Applegate, Mateo Díaz, Haihao Lu, and Miles Lubin, Infeasibility detection with primal-dual hybrid gradient for large-scale linear programming, arXiv preprint arXiv:2102.04592 (2021).
  • [4] David Applegate, Oliver Hinder, Haihao Lu, and Miles Lubin, Faster first-order primal-dual methods for linear programming using restarts and sharpness, arXiv preprint arXiv:2105.12715 (2021).
  • [5] Dimitris Bertsimas and John N Tsitsiklis, Introduction to linear optimization, vol. 6, Athena Scientific Belmont, MA, 1997.
  • [6] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik, A training algorithm for optimal margin classifiers, Proceedings of the fifth annual workshop on Computational learning theory, 1992, pp. 144–152.
  • [7] Kristian Bredies and Martin Holler, A tgv-based framework for variational image decompression, zooming, and reconstruction. part i: Analytics, SIAM Journal on Imaging Sciences 8 (2015), no. 4, 2814–2850.
  • [8] Yang Cai, Argyris Oikonomou, and Weiqiang Zheng, Tight last-iterate convergence of the extragradient method for constrained monotone variational inequalities, arXiv preprint arXiv:2204.09228 (2022).
  • [9] Antonin Chambolle, Matthias J Ehrhardt, Peter Richtárik, and Carola-Bibiane Schonlieb, Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications, SIAM Journal on Optimization 28 (2018), no. 4, 2783–2808.
  • [10] Antonin Chambolle and Thomas Pock, A first-order primal-dual algorithm for convex problems with applications to imaging, Journal of mathematical imaging and vision 40 (2011), no. 1, 120–145.
  • [11]  , On the ergodic convergence rates of a first-order primal–dual algorithm, Mathematical Programming 159 (2016), no. 1-2, 253–287.
  • [12] Shaobing Chen and David Donoho, Basis pursuit, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers, vol. 1, IEEE, 1994, pp. 41–44.
  • [13] Patrick L Combettes, Laurent Condat, J-C Pesquet, and BC Vũ, A forward-backward view of some primal-dual optimization methods in image recovery, 2014 IEEE International Conference on Image Processing (ICIP), IEEE, 2014, pp. 4141–4145.
  • [14] Laurent Condat, A primal–dual splitting method for convex optimization involving lipschitzian, proximable and linear composite terms, Journal of optimization theory and applications 158 (2013), no. 2, 460–479.
  • [15] Corinna Cortes and Vladimir Vapnik, Support-vector networks, Machine learning 20 (1995), no. 3, 273–297.
  • [16] George B Dantzig, Linear programming, Operations research 50 (2002), no. 1, 42–47.
  • [17] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng, Training GANs with optimism, International Conference on Learning Representations, 2018.
  • [18] Damek Davis and Wotao Yin, Convergence rate analysis of several splitting schemes, Splitting methods in communication, imaging, science, and engineering, Springer, 2016, pp. 115–163.
  • [19]  , Faster convergence rates of relaxed peaceman-rachford and admm under regularity assumptions, Mathematics of Operations Research 42 (2017), no. 3, 783–805.
  • [20] Asen L Dontchev and R Tyrrell Rockafellar, Regularity and conditioning of solution mappings in variational analysis, Set-Valued Analysis 12 (2004), no. 1, 79–109.
  • [21]  , Implicit functions and solution mappings, vol. 543, Springer, 2009.
  • [22] Jim Douglas and Henry H Rachford, On the numerical solution of heat conduction problems in two and three space variables, Transactions of the American mathematical Society 82 (1956), no. 2, 421–439.
  • [23] Dmitriy Drusvyatskiy and Adrian S Lewis, Error bounds, quadratic growth, and linear convergence of proximal methods, Mathematics of Operations Research 43 (2018), no. 3, 919–948.
  • [24] Jonathan Eckstein and Dimitri P Bertsekas, On the Douglas—Rachford splitting method and the proximal point algorithm for maximal monotone operators, Mathematical Programming 55 (1992), no. 1-3, 293–318.
  • [25] Ernie Esser, Xiaoqun Zhang, and Tony F Chan, A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science, SIAM Journal on Imaging Sciences 3 (2010), no. 4, 1015–1046.
  • [26] Olivier Fercoq, Quadratic error bound of the smoothed gap and the restarted averaged primal-dual hybrid gradient, (2021).
  • [27] Olivier Fercoq and Pascal Bianchi, A coordinate-descent primal-dual algorithm with large step size and possibly nonseparable functions, SIAM Journal on Optimization 29 (2019), no. 1, 100–134.
  • [28] Guy Gilboa, Michael Moeller, and Martin Burger, Nonlinear spectral analysis via one-homogeneous functionals: overview and future prospects, Journal of Mathematical Imaging and Vision 56 (2016), no. 2, 300–319.
  • [29] Tom Goldstein, Min Li, and Xiaoming Yuan, Adaptive primal-dual splitting methods for statistical learning and image processing, Advances in Neural Information Processing Systems, 2015, pp. 2089–2097.
  • [30] Noah Golowich, Sarath Pattathil, Constantinos Daskalakis, and Asuman Ozdaglar, Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems, Conference on Learning Theory, PMLR, 2020, pp. 1758–1784.
  • [31] Eduard Gorbunov, Nicolas Loizou, and Gauthier Gidel, Extragradient method: O(1/k){O}(1/k) last-iterate convergence for monotone variational inequalities and connections with cocoercivity, arXiv preprint arXiv:2110.04261 (2021).
  • [32] Eduard Gorbunov, Adrien Taylor, and Gauthier Gidel, Last-iterate convergence of optimistic gradient method for monotone variational inequalities, arXiv preprint arXiv:2205.08446 (2022).
  • [33] Benjamin Grimmer, Haihao Lu, Pratik Worah, and Vahab Mirrokni, The landscape of the proximal point method for nonconvex-nonconcave minimax optimization, arXiv preprint arXiv:2006.08667 (2020).
  • [34] Bingsheng He and Xiaoming Yuan, Convergence analysis of primal-dual algorithms for a saddle-point problem: from contraction perspective, SIAM Journal on Imaging Sciences 5 (2012), no. 1, 119–149.
  • [35] René Henrion, Abderrahim Jourani, and Jiri Outrata, On the calmness of a class of multifunctions, SIAM Journal on Optimization 13 (2002), no. 2, 603–618.
  • [36] AD Ioffe, Necessary and sufficient conditions for a local minimum. 1: A reduction theorem and first order conditions, SIAM Journal on Control and Optimization 17 (1979), no. 2, 245–250.
  • [37] Alexander D Ioffe and Jiří V Outrata, On metric and calmness qualification conditions in subdifferential calculus, Set-Valued Analysis 16 (2008), no. 2, 199–227.
  • [38] Florian Knoll, Martin Holler, Thomas Koesters, Ricardo Otazo, Kristian Bredies, and Daniel K Sodickson, Joint mr-pet reconstruction using a multi-channel image regularizer, IEEE transactions on medical imaging 36 (2016), no. 1, 1–16.
  • [39] Kwangmoo Koh, Seung-Jean Kim, and Stephen Boyd, An interior-point method for large-scale l1-regularized logistic regression, Journal of Machine learning research 8 (2007), no. Jul, 1519–1555.
  • [40] Galina M Korpelevich, The extragradient method for finding saddle points and other problems, Matecon 12 (1976), 747–756.
  • [41] Alexander Y Kruger, Error bounds and metric subregularity, Optimization 64 (2015), no. 1, 49–79.
  • [42] Puya Latafat, Nikolaos M Freris, and Panagiotis Patrinos, A new randomized block-coordinate primal-dual proximal algorithm for distributed optimization, IEEE Transactions on Automatic Control 64 (2019), no. 10, 4050–4065.
  • [43] Sucheol Lee and Donghwan Kim, Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems, Advances in Neural Information Processing Systems 34 (2021), 22588–22600.
  • [44] Jingwei Liang, Jalal Fadili, and Gabriel Peyré, Convergence rates with inexact non-expansive operators, Mathematical Programming 159 (2016), no. 1, 403–434.
  • [45] Yanli Liu, Yunbei Xu, and Wotao Yin, Acceleration of primal–dual methods by preconditioning and simple subproblem procedures, Journal of Scientific Computing 86 (2021), no. 2, 1–34.
  • [46] Haihao Lu, An O(sr){O}(s^{r})-resolution ODE framework for discrete-time optimization algorithms and applications to convex-concave saddle-point problems, arXiv preprint arXiv:2001.08826 (2020).
  • [47] Haihao Lu and Jinwen Yang, Nearly optimal linear convergence of stochastic primal-dual methods for linear programming, arXiv preprint arXiv:2111.05530 (2021).
  • [48] Meng Lu and Zheng Qu, An adaptive proximal point algorithm framework and application to large-scale optimization, arXiv preprint arXiv:2008.08784 (2020).
  • [49] Yura Malitsky and Thomas Pock, A first-order primal-dual algorithm with linesearch, SIAM Journal on Optimization 28 (2018), no. 1, 411–432.
  • [50] Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil, A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach, International Conference on Artificial Intelligence and Statistics, 2020.
  • [51] Arkadi Nemirovski, Prox-method with rate of convergence O(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems, SIAM Journal on Optimization 15 (2004), no. 1, 229–251.
  • [52] Yurii Nesterov, Introductory lectures on convex optimization: A basic course, vol. 87, Springer Science & Business Media, 2003.
  • [53] Daniel O’Connor and Lieven Vandenberghe, On the equivalence of the primal-dual hybrid gradient method and douglas–rachford splitting, Mathematical Programming 179 (2020), no. 1, 85–108.
  • [54] Thomas Pock and Antonin Chambolle, Diagonal preconditioning for first order primal-dual algorithms in convex optimization, 2011 International Conference on Computer Vision, IEEE, 2011, pp. 1762–1769.
  • [55] R. Tyrrell Rockafellar, Monotone operators and the proximal point algorithm, SIAM Journal on Control and Optimization 14 (1976), no. 5, 877–898.
  • [56] Ernest K Ryu and Wotao Yin, Large-scale convex optimization: Algorithms & analyses via monotone operators, Cambridge University Press, 2022.
  • [57] Robert Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series B (Methodological) 58 (1996), no. 1, 267–288.
  • [58] Paul Tseng, On linear convergence of iterative methods for the variational inequality problem, Journal of Computational and Applied Mathematics 60 (1995), no. 1-2, 237–252.
  • [59] Maria-Luiza Vladarean, Yura Malitsky, and Volkan Cevher, A first-order primal-dual method with adaptivity to local smoothness, Advances in Neural Information Processing Systems 34 (2021), 6171–6182.
  • [60] Wei Hong Yang and Deren Han, Linear convergence of the alternating direction method of multipliers for a class of convex optimization problems, SIAM journal on Numerical Analysis 54 (2016), no. 2, 625–640.
  • [61] TaeHo Yoon and Ernest K Ryu, Accelerated algorithms for smooth convex-concave minimax problems with O(1/k2){O}(1/{k^{2}}) rate on squared gradient norm, International Conference on Machine Learning, PMLR, 2021, pp. 12098–12109.
  • [62] Xiaoming Yuan, Shangzhi Zeng, and Jin Zhang, Discerning the linear convergence of admm for structured convex optimization through the lens of variational analysis., J. Mach. Learn. Res. 21 (2020), 83–1.
  • [63] Xi Yin Zheng and Kung Fu Ng, Metric subregularity and constraint qualifications for convex generalized equations in banach spaces, SIAM Journal on Optimization 18 (2007), no. 2, 437–460.
  • [64] Xi Yin Zheng and Kung Fu Ng, Metric subregularity and calmness for nonconvex generalized equations in banach spaces, SIAM Journal on Optimization 20 (2010), no. 5, 2119–2136.
  • [65]  , Metric subregularity of piecewise linear multifunctions and applications to piecewise linear multiobjective optimization, SIAM Journal on Optimization 24 (2014), no. 1, 154–174.
  • [66] Mingqiang Zhu and Tony Chan, An efficient primal-dual hybrid gradient algorithm for total variation image restoration, UCLA Cam Report 34 (2008), 8–34.

Appendix A Analysis for ADMM

In this section, we consider algorithms with update rule as follows: for some positive semi-definite matrix 𝒫\mathcal{P},

𝒫(zkzk+1)(zk+1).\mathcal{P}(z_{k}-z_{k+1})\in\mathcal{F}(z_{k+1})\ . (32)

The previous results for IDS require matrix 𝒫\mathcal{P} to be positive definite. Here we extend the notion of IDS to handle positive semi-definite but not necessarily full rank 𝒫\mathcal{P}. We start with the following definition of IDS. Note that for positive definite 𝒫\mathcal{P}, it reduces to Definition 1.

Definition 4.

The infimal sub-differential size (IDS) at solution zz for algorithm with update rule (32) for solving convex-concave minimax problem (1) is defined as:

dist𝒫2(0,(z)range(𝒫)),dist_{\mathcal{P}^{-}}^{2}(0,\mathcal{F}(z)\cap\mathrm{range}(\mathcal{P}))\ ,

where 𝒫\mathcal{P}^{-} is the pseudo-inverse of 𝒫\mathcal{P} and range(𝒫)\mathrm{range}(\mathcal{P}) is the range of matrix 𝒫\mathcal{P}. (z)=(f(x)+ATyAx+g(y))\mathcal{F}(z)=\begin{pmatrix}\partial f(x)+A^{T}y\\ -Ax+\partial g(y)\end{pmatrix} is the sub-differential of the objective (x,y)\mathcal{L}(x,y).

There are two major differences to deal this case with singular 𝒫\mathcal{P}. First, pseudo-inverse of 𝒫\mathcal{P} is used since the inverse 𝒫1\mathcal{P}^{-1} is not well-defined. Furthermore, in contrast to (z)\mathcal{F}(z) in Definition 1, here we consider (z)range(𝒫)\mathcal{F}(z)\cap\mathrm{range}(\mathcal{P}) in that 𝒫\mathcal{P} is positive definite along the subspace range(𝒫)\mathrm{range}(\mathcal{P}) but not necessarily the whole space.

IDS under Definition 4 is still a valid metric for optimality, namely, if dist𝒫2(0,(zk))dist^{2}_{\mathcal{P}^{-}}(0,\mathcal{F}(z_{k})) is small, then zk+1z_{k+1} is a good approximation to the solution. Moreover, it also keeps the desirable properties: monotonic decay, sublinear convergence and linear rate under metric sub-regularity. The results are summarized in the following propositions.

Proposition 3.

Consider algorithm with update rule (32) for solving a convex-concave minimax problem (1). It holds for k0k\geq 0 that

dist𝒫2(0,(zk+1)range(𝒫))dist𝒫2(0,(zk)range(𝒫)).dist_{\mathcal{P}^{-}}^{2}\left({0,\mathcal{F}(z_{k+1})\cap\mathrm{range}(\mathcal{P})}\right)\leq dist_{\mathcal{P}^{-}}^{2}\left({0,\mathcal{F}(z_{k})\cap\mathrm{range}(\mathcal{P})}\right)\ .
Proof.

The proof is similar to Proposition 1. First, notice that the iterate update is given by

𝒫(zkzk+1)(zk+1),\mathcal{P}(z_{k}-z_{k+1})\in\mathcal{F}(z_{k+1})\ ,

and it is also obvious that

𝒫(zkzk+1)range(𝒫).\mathcal{P}(z_{k}-z_{k+1})\in\mathrm{range}(\mathcal{P})\ .

Hence we have

𝒫(zkzk+1)(zk+1)range(𝒫).\mathcal{P}(z_{k}-z_{k+1})\in\mathcal{F}(z_{k+1})\cap\mathrm{range}(\mathcal{P})\ .

Now denote ω~k+1=𝒫(zkzk+1)(zk+1)range(𝒫)\widetilde{\omega}_{k+1}=\mathcal{P}(z_{k}-z_{k+1})\in\mathcal{F}(z_{k+1})\cap\mathrm{range}(\mathcal{P}), then it holds for any ωk(zk)range(𝒫)\omega_{k}\in\mathcal{F}(z_{k})\cap\mathrm{range}(\mathcal{P}) that

ω~k+1𝒫2ωk𝒫2=ω~k+1ωk,𝒫(ω~k+1+ωk)ω~k+1ωk,2(zk+1zk)+𝒫(ω~k+1+ωk)=𝒫(zkzk+1)ωk,2(zk+1zk)+𝒫(𝒫(zkzk+1)+ωk)=𝒫(zkzk+1)ωk,(2I+𝒫𝒫)(zkzk+1)+𝒫ωk=zkzk+1𝒫2+2ωkT(zkzk+1)ωk𝒫2=zkzk+1𝒫2+2ωkT𝒫𝒫(zkzk+1)ωk𝒫2=𝒫(zkzk+1)ωk𝒫2=ω~k+1ωk𝒫20,\displaystyle\begin{split}\|\widetilde{\omega}_{k+1}\|_{\mathcal{P}^{-}}^{2}-\|\omega_{k}\|_{\mathcal{P}^{-}}^{2}&=\left\langle\widetilde{\omega}_{k+1}-\omega_{k},\mathcal{P}^{-}(\widetilde{\omega}_{k+1}+\omega_{k})\right\rangle\\ &\leq\left\langle\widetilde{\omega}_{k+1}-\omega_{k},2(z_{k+1}-z_{k})+\mathcal{P}^{-}(\widetilde{\omega}_{k+1}+\omega_{k})\right\rangle\\ &=\left\langle\mathcal{P}(z_{k}-z_{k+1})-\omega_{k},2(z_{k+1}-z_{k})+\mathcal{P}^{-}(\mathcal{P}(z_{k}-z_{k+1})+\omega_{k})\right\rangle\\ &=\left\langle\mathcal{P}(z_{k}-z_{k+1})-\omega_{k},(-2I+\mathcal{P}^{-}\mathcal{P})(z_{k}-z_{k+1})+\mathcal{P}^{-}\omega_{k}\right\rangle\\ &=-\|z_{k}-z_{k+1}\|_{\mathcal{P}}^{2}+2\omega_{k}^{T}(z_{k}-z_{k+1})-\|\omega_{k}\|_{\mathcal{P}^{-}}^{2}\\ &=-\|z_{k}-z_{k+1}\|_{\mathcal{P}}^{2}+2\omega_{k}^{T}\mathcal{P}^{-}\mathcal{P}(z_{k}-z_{k+1})-\|\omega_{k}\|_{\mathcal{P}^{-}}^{2}\\ &=-\|\mathcal{P}(z_{k}-z_{k+1})-\omega_{k}\|_{\mathcal{P}^{-}}^{2}\\ &=-\|\widetilde{\omega}_{k+1}-\omega_{k}\|_{\mathcal{P}^{-}}^{2}\\ &\leq 0\ ,\end{split}

where the inequality utilizes \mathcal{F} is a monotone operator due to the convexity-concavity of the objective (x,y)\mathcal{L}(x,y), thus ω~k+1ωk,zk+1zk0\left\langle\widetilde{\omega}_{k+1}-\omega_{k},z_{k+1}-z_{k}\right\rangle\geq 0. The second equality uses ω~k+1=𝒫(zkzk+1)\widetilde{\omega}_{k+1}=\mathcal{P}(z_{k}-z_{k+1}). and the fifth equality follows from ωk(zk)range(𝒫)\omega_{k}\in\mathcal{F}(z_{k})\cap\mathrm{range}(\mathcal{P}) that ωk=𝒫𝒫ωk\omega_{k}=\mathcal{P}^{-}\mathcal{P}\omega_{k}.

Now choose ωk=argminw(zk)range(𝒫){w𝒫12}\omega_{k}=\arg\min_{w\in\mathcal{F}(z_{k})\cap\mathrm{range}(\mathcal{P})}\{\|w\|_{\mathcal{P}^{-1}}^{2}\}, and we obtain

dist𝒫12(0,(zk+1)range(𝒫))ω~k+1𝒫12ωk𝒫12=dist𝒫12(0,(zk)range(𝒫)),dist_{\mathcal{P}^{-1}}^{2}(0,\mathcal{F}(z_{k+1})\cap\mathrm{range}(\mathcal{P}))\leq\|\widetilde{\omega}_{k+1}\|_{\mathcal{P}^{-1}}^{2}\leq\|\omega_{k}\|_{\mathcal{P}^{-1}}^{2}=dist_{\mathcal{P}^{-1}}^{2}(0,\mathcal{F}(z_{k})\cap\mathrm{range}(\mathcal{P}))\ ,

which finishes the proof. ∎

Provided the monotonic decay, sublinear convergence holds and thus under metric sub-regularity under 𝒫\mathcal{P}-norm, IDS converges linearly. The proofs are identical to Theorem 1 and Theorem 2 respectively.

Proposition 4.

Consider the iterates {zk}k=0\{z_{k}\}_{k=0}^{\infty} of algorithm with update rule (32) to solve a convex-concave minimax problem (1). Let z𝒵z_{*}\in\mathcal{Z}^{*} be an optimal point to (1). Then, it holds for any iteration k1k\geq 1 that

dist𝒫2(0,(zk)range(𝒫))1kz0z𝒫2.dist_{\mathcal{P}^{-}}^{2}(0,\mathcal{F}(z_{k})\cap\mathrm{range}(\mathcal{P}))\leq\frac{1}{k}\|z_{0}-z_{*}\|_{\mathcal{P}}^{2}\ .
Proposition 5.

Consider the iterations {zk}k=0\{z_{k}\}_{k=0}^{\infty} of algorithm with update rule (32) to solve a convex-concave minimax problem (1). Suppose the minimax problem (1) satisfies metric sub-regularity condition (9) on a set 𝕊\mathbb{S} that contains {zk}k=0\{z_{k}\}_{k=0}^{\infty}, that is,

α𝒫dist𝒫(z,𝒵)dist𝒫(0,(z)).\alpha_{\mathcal{P}}dist_{\mathcal{P}}(z,\mathcal{Z}^{*})\leq dist_{\mathcal{P}^{-}}(0,\mathcal{F}(z))\ .

Then, it holds for any iteration ke/α𝒫2k\geq\left\lceil e/\alpha_{\mathcal{P}}^{2}\right\rceil that

dist𝒫2(0,(zk)range(𝒫))exp(1ke/α𝒫2)dist𝒫2(0,(z0)range(𝒫)).dist_{\mathcal{P}^{-}}^{2}(0,\mathcal{F}(z_{k})\cap\mathrm{range}(\mathcal{P}))\leq\exp\left({1-\frac{k}{\left\lceil e/\alpha_{\mathcal{P}}^{2}\right\rceil}}\right)dist_{\mathcal{P}^{-}}^{2}(0,\mathcal{F}(z_{0})\cap\mathrm{range}(\mathcal{P}))\ .